Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving GC collections: dynamic thresholds, single generation gc and time barriers #100403

Open
pablogsal opened this issue Dec 21, 2022 · 29 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@pablogsal
Copy link
Member

pablogsal commented Dec 21, 2022

In the pursuit of trying to optimize GC runs it has been observed that the weak generational hypothesis may not apply that well to Python. This is because, according to this argument, in the presence of a mixel cycle GC + refcount GC strategy, young objects are mostly cleaned up by reference count, not by the GC. Is important as well that there is no segregation between the GC strategies and that the cycle GC needs to deal with objects that according to this will be mainly cleaned up by reference count alone.

This questions the utility of segregating GC by generations and indeed there is some evidence of this. I have been benchmarking the percentage of success of different generations in some programs (such as blach and mypy and a bunch of HTTP servers) and the success rate of the lower generations is generally small. Here is an example of running black over all the standard library:

Statistics for generation 0

Category Value
count 157917.000000
mean 1.192775
std 3.560391
min 0.000000
25% 0.000000
50% 0.000000
75% 0.480192
max 86.407768

Statistics for generation 1

Category Value
count 14346.000000
mean 2.670852
std 11.642815
min 0.000000
25% 0.000000
50% 0.000000
75% 0.388794
max 97.406097

Statistics for generation 2

Category Value
count 1280.000000
mean 45.698135
std 27.066735
min 0.000000
25% 31.965862
50% 54.618008
75% 67.038842
max 90.592705

I am currently investigating if having a single generation with a dynamic threshold that is similar to the strategy that we use currently for the last generation would be generally better to get better performance.

What do you think?

Linked PRs

@pablogsal
Copy link
Member Author

CC @tim-one @methane @nascheme

@pablogsal
Copy link
Member Author

Here is a proof of concept for these changes if you want to play with it:

#100404

@methane
Copy link
Member

methane commented Dec 21, 2022

I agree that current threshold is not good for many applications.

But I still think generations are good for some cases.

Python creates a lot of cyclic objects that lives until process shutdown. For example, modules, classes, functions, namespace dicts, and annotations. (PEP 563 eliminates most cycle of annotations, but it would deprecated by PEP 649.)

For large application using SQLAlchemy, full GC takes more than 10ms.
Younger generation can collect cycles created during web requests without tracing permanent objects.

Of course, permanent generation can be used to avoid full GC traces permanent objects. But it requires manual GC tuning...

@pablogsal
Copy link
Member Author

pablogsal commented Dec 21, 2022

For large application using SQLAlchemy, full GC takes more than 10ms.
Younger generation can collect cycles created during web requests without tracing permanent objects.

Yeah, but they key is that is not clear if a single run of the last generation (or even a few more) is going to be better than a lot of inefficient runs of the lower generations + a few of the full generation.

I am sure there are examples of how both ways can work but I think we need to focus on something that is more common because otherwise we can always find workloads that can give us reasons to do one approach or the other.

Do you have a benchmark of a SQLAlchemy application?

I agree that current threshold is not good for many applications.

The other idea is adjust the thresholds dynamically based on how effective a collection has been. For instance, we can set a target for efficiency (defined as how many objects are collected divided by the size of the generation) and we adjust the threshold for the next collection based on how much the actual collection separates from the target.

For example, if a target is 25% and a collection collects 5% we adjust the threshold with a scale factor f(5/25) making the generation threshold bigger and coversely if we collect 75% we reduce the size by f(75/25).

What is the scale factor is up for discuss.

What do you think of this approach?

@pablogsal
Copy link
Member Author

pablogsal commented Dec 21, 2022

Benchmark of #100404 (:warning: please take into account that this is still a DRAFT and we may be missing something ⚠️ )

All benchmarks:

Benchmark 2022-12-21_09-50-main-a7715ccfba5b 2022-12-21_14-11-gc_nogen-663a9654f714
async_tree_io 1.52 sec 658 ms: 2.31x faster
async_tree_memoization 753 ms 397 ms: 1.89x faster
async_tree_none 621 ms 334 ms: 1.86x faster
async_tree_cpu_io_mixed 924 ms 607 ms: 1.52x faster
xml_etree_iterparse 121 ms 98.7 ms: 1.23x faster
docutils 3.13 sec 2.68 sec: 1.17x faster
float 89.9 ms 77.0 ms: 1.17x faster
xml_etree_parse 173 ms 151 ms: 1.14x faster
unpack_sequence 56.3 ns 51.1 ns: 1.10x faster
deltablue 4.24 ms 3.98 ms: 1.07x faster
async_generators 374 ms 351 ms: 1.07x faster
logging_silent 127 ns 121 ns: 1.05x faster
genshi_xml 60.5 ms 57.6 ms: 1.05x faster
xml_etree_generate 105 ms 101 ms: 1.04x faster
regex_effbot 3.94 ms 3.79 ms: 1.04x faster
pickle_pure_python 364 us 352 us: 1.04x faster
2to3 310 ms 299 ms: 1.03x faster
python_startup 11.2 ms 10.9 ms: 1.03x faster
html5lib 72.5 ms 70.2 ms: 1.03x faster
generators 66.7 ms 64.8 ms: 1.03x faster
json_loads 31.9 us 31.0 us: 1.03x faster
regex_dna 227 ms 222 ms: 1.03x faster
pidigits 232 ms 226 ms: 1.02x faster
python_startup_no_site 8.65 ms 8.46 ms: 1.02x faster
sqlglot_parse 1.81 ms 1.77 ms: 1.02x faster
nbody 113 ms 110 ms: 1.02x faster
unpickle_pure_python 259 us 254 us: 1.02x faster
deepcopy_reduce 3.81 us 3.75 us: 1.02x faster
deepcopy 427 us 420 us: 1.02x faster
go 173 ms 170 ms: 1.02x faster
pyflate 501 ms 493 ms: 1.02x faster
genshi_text 26.3 ms 25.9 ms: 1.01x faster
sqlglot_transpile 2.15 ms 2.13 ms: 1.01x faster
logging_format 8.04 us 7.94 us: 1.01x faster
richards 53.7 ms 53.1 ms: 1.01x faster
coverage 545 ms 538 ms: 1.01x faster
sympy_expand 571 ms 564 ms: 1.01x faster
nqueens 103 ms 102 ms: 1.01x faster
xml_etree_process 72.6 ms 71.7 ms: 1.01x faster
deepcopy_memo 44.5 us 44.0 us: 1.01x faster
mako 12.1 ms 12.0 ms: 1.01x faster
telco 7.77 ms 7.70 ms: 1.01x faster
unpickle_list 5.97 us 5.92 us: 1.01x faster
logging_simple 7.28 us 7.24 us: 1.01x faster
mdp 3.37 sec 3.35 sec: 1.01x faster
sympy_integrate 25.5 ms 25.4 ms: 1.00x faster
raytrace 370 ms 371 ms: 1.00x slower
regex_v8 26.1 ms 26.2 ms: 1.00x slower
chameleon 8.42 ms 8.45 ms: 1.00x slower
hexiom 7.81 ms 7.84 ms: 1.00x slower
scimark_fft 380 ms 382 ms: 1.00x slower
bench_thread_pool 1.00 ms 1.01 ms: 1.01x slower
scimark_sor 138 ms 138 ms: 1.01x slower
sqlglot_normalize 135 ms 136 ms: 1.01x slower
scimark_monte_carlo 79.0 ms 79.6 ms: 1.01x slower
sqlglot_optimize 64.4 ms 64.9 ms: 1.01x slower
unpickle 16.2 us 16.4 us: 1.01x slower
json_dumps 12.1 ms 12.2 ms: 1.01x slower
pprint_pformat 1.77 sec 1.79 sec: 1.01x slower
regex_compile 162 ms 164 ms: 1.01x slower
scimark_sparse_mat_mult 4.64 ms 4.70 ms: 1.01x slower
crypto_pyaes 91.0 ms 92.1 ms: 1.01x slower
pprint_safe_repr 862 ms 877 ms: 1.02x slower
meteor_contest 127 ms 130 ms: 1.02x slower
sqlite_synth 2.52 us 2.57 us: 1.02x slower
chaos 82.4 ms 84.3 ms: 1.02x slower
fannkuch 451 ms 463 ms: 1.03x slower
django_template 41.9 ms 43.1 ms: 1.03x slower
spectral_norm 122 ms 126 ms: 1.03x slower
pickle_list 4.90 us 5.17 us: 1.06x slower
coroutines 33.9 ms 36.0 ms: 1.06x slower
pickle_dict 36.8 us 39.1 us: 1.06x slower
Geometric mean (ref) 1.05x faster

Benchmark hidden because not significant (6): scimark_lu, bench_mp_pool, sympy_sum, pickle, sympy_str, pathlib

@mdboom
Copy link
Contributor

mdboom commented Dec 21, 2022

Here's the results on the Faster CPython benchmarking system (basically confirming the above plus or minus the usual noise): https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20221221-3.12.0a3+-663a965

@pablogsal pablogsal changed the title Single generation GC Improving GC collections: dynamic thresholds, single generation gc and time barriers Dec 22, 2022
@pablogsal
Copy link
Member Author

pablogsal commented Dec 22, 2022

Here is another different idea: #100421

This PR implements time barriers in the GC. The main idea is to control how much time the GC represents over total runtime. To do that we calculate the running average of the GC time, the total time since the GC was called and if this ratio is bigger than a specific threshold (I suppose this can be configurable) then we hold the GC from running. The idea is that if the GC is running too much then we hold it off.

I think configuring this threshold may be much easier than changing the knobs of the generation thresholds because is much easier to understand what this is achieving.

I think this is not going to be as exciting as the previous ideas and the system call for the time has the potential to slow down some use cases unless we combine it with other approaches.

@pablogsal
Copy link
Member Author

pablogsal commented Dec 22, 2022

And here is yet another idea: #100422

This implements dynamic thresholds based on how successful a specific collection was. When we do a collection we calculate how many objects have been collected divided by how many objects were initially considered. Then we update the threshold for that generation based on the difference between the current ratio of collected to total objects and a given configurable target ratio. The update uses a sigmoid that saturates between 0 and 1 and peaks over given values of max_threshold and min_threshold.

This has the advantage over the first method that keeps the generations.

Benchmarks for this idea:

All benchmarks:

Benchmark 2022-12-21_09-50-main-a7715ccfba5b 2022-12-22_00-36-gc_thresholds-a292d5ccb575
async_tree_io 1.52 sec 740 ms: 2.06x faster
async_tree_memoization 753 ms 442 ms: 1.70x faster
async_tree_none 621 ms 371 ms: 1.67x faster
async_tree_cpu_io_mixed 924 ms 647 ms: 1.43x faster
float 89.9 ms 80.9 ms: 1.11x faster
unpack_sequence 56.3 ns 51.8 ns: 1.09x faster
genshi_xml 60.5 ms 58.1 ms: 1.04x faster
xml_etree_parse 173 ms 167 ms: 1.03x faster
xml_etree_iterparse 121 ms 117 ms: 1.03x faster
regex_effbot 3.94 ms 3.83 ms: 1.03x faster
html5lib 72.5 ms 70.7 ms: 1.03x faster
deltablue 4.24 ms 4.14 ms: 1.02x faster
pidigits 232 ms 227 ms: 1.02x faster
coroutines 33.9 ms 33.2 ms: 1.02x faster
pathlib 19.5 ms 19.1 ms: 1.02x faster
async_generators 374 ms 366 ms: 1.02x faster
go 173 ms 170 ms: 1.02x faster
json_dumps 12.1 ms 11.9 ms: 1.02x faster
xml_etree_generate 105 ms 103 ms: 1.02x faster
telco 7.77 ms 7.62 ms: 1.02x faster
logging_silent 127 ns 125 ns: 1.02x faster
json_loads 31.9 us 31.3 us: 1.02x faster
scimark_lu 137 ms 135 ms: 1.02x faster
sqlglot_parse 1.81 ms 1.78 ms: 1.02x faster
xml_etree_process 72.6 ms 71.6 ms: 1.01x faster
meteor_contest 127 ms 126 ms: 1.01x faster
scimark_fft 380 ms 376 ms: 1.01x faster
sqlglot_normalize 135 ms 134 ms: 1.01x faster
chameleon 8.42 ms 8.35 ms: 1.01x faster
nqueens 103 ms 103 ms: 1.00x faster
regex_v8 26.1 ms 26.0 ms: 1.00x faster
regex_compile 162 ms 162 ms: 1.00x slower
hexiom 7.81 ms 7.83 ms: 1.00x slower
sqlglot_optimize 64.4 ms 64.5 ms: 1.00x slower
sympy_expand 571 ms 573 ms: 1.00x slower
bench_thread_pool 1.00 ms 1.00 ms: 1.00x slower
crypto_pyaes 91.0 ms 91.4 ms: 1.01x slower
pprint_pformat 1.77 sec 1.78 sec: 1.01x slower
unpickle_pure_python 259 us 261 us: 1.01x slower
fannkuch 451 ms 455 ms: 1.01x slower
scimark_sor 138 ms 139 ms: 1.01x slower
sympy_sum 205 ms 207 ms: 1.01x slower
logging_simple 7.28 us 7.35 us: 1.01x slower
pprint_safe_repr 862 ms 869 ms: 1.01x slower
django_template 41.9 ms 42.2 ms: 1.01x slower
sympy_str 353 ms 356 ms: 1.01x slower
sympy_integrate 25.5 ms 25.7 ms: 1.01x slower
bench_mp_pool 29.9 ms 30.2 ms: 1.01x slower
deepcopy 427 us 432 us: 1.01x slower
logging_format 8.04 us 8.13 us: 1.01x slower
unpickle 16.2 us 16.4 us: 1.01x slower
scimark_sparse_mat_mult 4.64 ms 4.70 ms: 1.01x slower
deepcopy_memo 44.5 us 45.1 us: 1.01x slower
chaos 82.4 ms 83.5 ms: 1.01x slower
python_startup 11.2 ms 11.4 ms: 1.01x slower
python_startup_no_site 8.65 ms 8.80 ms: 1.02x slower
spectral_norm 122 ms 124 ms: 1.02x slower
genshi_text 26.3 ms 26.9 ms: 1.02x slower
sqlite_synth 2.52 us 2.58 us: 1.02x slower
2to3 310 ms 319 ms: 1.03x slower
mako 12.1 ms 12.5 ms: 1.03x slower
pickle_list 4.90 us 5.09 us: 1.04x slower
unpickle_list 5.97 us 6.20 us: 1.04x slower
pickle_dict 36.8 us 38.6 us: 1.05x slower
regex_dna 227 ms 239 ms: 1.05x slower
Geometric mean (ref) 1.03x faster

This is how it compares to the first method:

All benchmarks:

Benchmark 2022-12-21_14-11-gc_nogen-663a9654f714 2022-12-22_00-36-gc_thresholds-a292d5ccb575
coroutines 36.0 ms 33.2 ms: 1.08x faster
meteor_contest 130 ms 126 ms: 1.03x faster
logging_silent 121 ns 125 ns: 1.03x slower
pickle_pure_python 352 us 364 us: 1.04x slower
genshi_text 25.9 ms 26.9 ms: 1.04x slower
python_startup_no_site 8.46 ms 8.80 ms: 1.04x slower
mako 12.0 ms 12.5 ms: 1.04x slower
deltablue 3.98 ms 4.14 ms: 1.04x slower
async_generators 351 ms 366 ms: 1.04x slower
unpickle_list 5.92 us 6.20 us: 1.05x slower
python_startup 10.9 ms 11.4 ms: 1.05x slower
float 77.0 ms 80.9 ms: 1.05x slower
2to3 299 ms 319 ms: 1.07x slower
async_tree_cpu_io_mixed 607 ms 647 ms: 1.07x slower
regex_dna 222 ms 239 ms: 1.08x slower
xml_etree_parse 151 ms 167 ms: 1.11x slower
async_tree_memoization 397 ms 442 ms: 1.11x slower
async_tree_none 334 ms 371 ms: 1.11x slower
async_tree_io 658 ms 740 ms: 1.12x slower
docutils 2.68 sec 3.08 sec: 1.15x slower
xml_etree_iterparse 98.7 ms 117 ms: 1.19x slower
Geometric mean (ref) 1.02x slower

Benchmark hidden because not significant (57): json_dumps, pathlib, django_template, fannkuch, pickle_list, sqlglot_normalize, scimark_fft, spectral_norm, pickle_dict, scimark_lu, chameleon, chaos, scimark_monte_carlo, telco, regex_compile, pprint_safe_repr, crypto_pyaes, regex_v8, raytrace, sqlglot_optimize, go, pprint_pformat, hexiom, xml_etree_process, bench_thread_pool, scimark_sparse_mat_mult, pickle, pidigits, sqlite_synth, scimark_sor, unpickle, sqlglot_parse, mdp, sympy_str, html5lib, nqueens, genshi_xml, json_loads, sympy_sum, sqlglot_transpile, regex_effbot, unpack_sequence, bench_mp_pool, sympy_integrate, deepcopy_reduce, logging_simple, sympy_expand, coverage, richards, nbody, pyflate, xml_etree_generate, logging_format, deepcopy_memo, unpickle_pure_python, generators, deepcopy

@nascheme
Copy link
Member

I suspect it really depends on the application. It could be the case that some (many?) applications could completely turn off the cyclic GC and only rely on ref counting. However, I imagine there are applications that generate cyclic garbage quickly and would perform quite badly with a single generation GC. It takes quite a lot more time to do the full collection and the young generation is effective at collecting recently created cyclic garbage.

The dynamic threshold sounds like a good idea to me. The challenge will be to make something that works for all CPython users. We have something that reduces the number of full collections, from gcmodule.c:

            /* Avoid quadratic performance degradation in number
               of tracked objects (see also issue #4074):

               To limit the cost of garbage collection, there are two strategies;
                 - make each collection faster, e.g. by scanning fewer objects
                 - do less collections
               This heuristic is about the latter strategy.

               In addition to the various configurable thresholds, we only trigger a
               full collection if the ratio

                long_lived_pending / long_lived_total

               is above a given value (hardwired to 25%).

               The reason is that, while "non-full" collections (i.e., collections of
               the young and middle generations) will always examine roughly the same
               number of objects -- determined by the aforementioned thresholds --,
               the cost of a full collection is proportional to the total number of
               long-lived objects, which is virtually unbounded.

               Indeed, it has been remarked that doing a full collection every
               <constant number> of object creations entails a dramatic performance
               degradation in workloads which consist in creating and storing lots of
               long-lived objects (e.g. building a large list of GC-tracked objects would
               show quadratic performance, instead of linear as expected: see issue #4074).

               Using the above ratio, instead, yields amortized linear performance in
               the total number of objects (the effect of which can be summarized
               thusly: "each full garbage collection is more and more costly as the
               number of objects grows, but we do fewer and fewer of them").

               This heuristic was suggested by Martin von Löwis on python-dev in
               June 2008. His original analysis and proposal can be found at:
               http://mail.python.org/pipermail/python-dev/2008-June/080579.html
            */

@pablogsal
Copy link
Member Author

pablogsal commented Dec 23, 2022

The dynamic threshold sounds like a good idea to me. The challenge will be to make something that works for all CPython users. We have something that reduces the number of full collections, from gcmodule.c:

What do you think of the approach of

#100403 (comment)

I think this could be interesting because I think is easy to understand how it changes the dynamic thresholds and I also think is easy to understand what the knobs will try to achieve in the gc behaviour.

Also it had good performance numbers in the benchmark suite

@gpshead
Copy link
Member

gpshead commented Dec 24, 2022

@nanjekyejoannah as well (who is also interested in GC if I recall correctly?)

@nascheme
Copy link
Member

nascheme commented Jan 3, 2023

Bench marking real applications is good but I think we should also run some simulations or do some analysis to estimate how the GC will be behave under different scenarios. To that end, I started writing a simulator. It is still crude and likely buggy.

To help with simulation and modelling, some profiling would help to determine the cost of minor and major GC runs. My first guess is that the cost is roughly O(n) where n is the number of objects in the generation. If that's correct, I don't see how the dynamic threshold change will help, at least how it is written. Each newly object gets examined once in the young generation and so doing it less often doesn't save any time, it just makes the gen 0 collection take longer. That's another thing to be concerned about: GC pause time. Making gen 0 and gen 1 large will increase GC pause time. It is a "stop the world" GC and some applications might not be happy with much longer pauses.

In my basic simulator, I set the fraction of objects that are trash. If that's roughly how things work, I don't think the threshold algorithm in gh-100422 is what we want. Say 2% of newly created objects immediately become cyclic trash. Since it is targeting 20% of the objects in gen 0 as trash, it will keep increasing the threshold until it reaches the upper limit. Making the generation larger does not increase the % of objects that are trash and I don't see how that's likely in any scenario. Perhaps the threshold needs to be based on something like (number of trash objects found / number of total live objects). However, that approach would result in GC pauses becoming long as the total number of live objects gets large.

BTW, it looks to me like gh-100422 sets thresholds for older generations so high that collections basically don't happen. Those thresholds are based on the number of collections, not number of objects added to the generation.

Some ideas on how to make progress:

  • add more GC metrics, enabled by a -X option or gc.set_debug(). E.g. time spent in different parts of the GC, number of objects surviving each generation. Use this to collect data from real applications and benchmarks
  • improve the simulator to match what the GC does, output metrics after simulation run
  • define objectives for optimization, trade-offs are: time spent in GC, length of GC pauses, overall memory use (fraction of trash vs total live objects).
  • design tuning knobs or automatic tuning the optimizes those objectives

Regarding your "Statistics" tables, those are interesting but it's not exactly clear to me what the numbers mean. Is the "count" directly from the GC generation count field? If so, I think gen 0 count is based on the number of objects in the generation whereas the gen 1 and gen 2 are based on number of collections. And I guess the rest of the table rows are based on the number of trash objects found. If that's all correct, gen 0 is not so bad as you think at finding trash whereas gen 1 is pretty useless. Maybe we should only have two generations. If would be interesting to see statistisc based on the fraction of objects found to be trash and the number of objects in each generation. Also, the amount of time for the GC vs size of generation. We might deduce the O(...) behaviour from that.

@pablogsal
Copy link
Member Author

pablogsal commented Jan 3, 2023

To help with simulation and modelling, some profiling would help to determine the cost of minor and major GC runs. My first guess is that the cost is roughly O(n) where n is the number of objects in the generation. If that's correct, I don't see how the dynamic threshold change will help, at least how it is written.

But under that assumption, if you make simulations that have a constant percentage of trash per collection, the size of the generation doesn't matter because the only thing that would determine the time in GC over a fixed time is the total number of objects that we have seen in the GC, and unless you add more assumptions that should not matter.

On the other hand, I think the key here is that having less GC collections allows refcount to kill most of the stuff that's not cyclic trash and therefore the collections would be more efficient. I think this is the key aspect that can optimize the GC time in general and you won't be able to capture this with a simulation that simulates fixed percentages.

Regarding your "Statistics" tables, those are interesting but it's not exactly clear to me what the numbers mean. Is the "count" directly from the GC generation count field? If so, I think gen 0 count is based on the number of objects in the generation whereas the gen 1 and gen 2 are based on number of collections. And I guess the rest of the table rows are based on the number of trash objects found. If that's all correct, gen 0 is not so bad as you think at finding trash whereas gen 1 is pretty useless. Maybe we should only have two generations. If would be interesting to see statistisc based on the fraction of objects found to be trash and the number of objects in each generation. Also, the amount of time for the GC vs size of generation. We might deduce the O(...) behaviour from that.

Apologies for the lack of clarity on the tables. "Count" is just the number of data points but that value alone doesn't tell you much so I would suggest ignoring it. The rest are the % of collected objects over the number of objects available at the start of a collection of that generation. As this is done across many different runs then you get different statistics. For example:

|mean | 1.192775|
|std | 3.560391|
|min | 0.000000|
|25% | 0.000000|
|50% | 0.000000|
|75% | 0.480192|
|max |86.407768|

Means that on average a collection of this generation collects 1% of objects in the generation, with a standard deviation of 3.5%. The 25/75 are the quantiles of the distribution and min and max are the efficiency of the worst and best collection. In this case the worst collection collected 0% of the objects and the best collection collected 86% of the objects.

BTW, it looks to me like gh-100422 sets thresholds for older generations so high that collections basically don't happen. Those thresholds are based on the number of collections, not number of objects added to the generation.

Yeah, I suspect that most of the improvements come from making the generations bigger. That's another question: maybe the generations we have currently are too small for many applications. Maybe just changing the numbers may help a bit. Coming up with 'good numbers' I suppose is going to be the most challenging part but I cannot see any proof that the status quo is maximizing any specific objective. Does anyone know where the numbers came from?

However, that approach would result in GC pauses becoming long as the total number of live objects gets large.

Yeah, although the idea is that over the long run, the actual total time of GC should be less. I assume these are two different axes to optimize, depending on your application.

If would be interesting to see statistisc based on the fraction of objects found to be trash and the number of objects in each generation.

I may be missing something but those are the statistics that I showed: is the % of trashed colledted objects over the objects in that generation.

@nascheme
Copy link
Member

nascheme commented Jan 3, 2023

On the other hand, I think the key here is that having less GC collections
allows refcount to kill most of the stuff that's not cyclic trash and
therefore the collections would be more efficient. I think this is the key
aspect that can optimize the GC time in general and you won't be able to
capture this with a simulation that simulates fixed percentages.

Ah, that's a good point and would explain how your change could improve
performance. If gen 0 is too small, the GC looks at a lot of objects that die
due to ref counting. That's wasted time.

It would be interesting to gather some data about that. I.e. when an object
dies which GC generation is in. Maybe need a special build that adds a field
to the GC head to keep track of the current generation. Otherwise it is too
expensive to compute.

The rest are the % of collected objects over the number of objects available
at the start of a collection of that generation.

Oh, I totally misunderstood then. That changes things a lot. The distribution
is highly skewed so "mean" is likely pretty useless. It could be useful to see
higher percentiles, e.g. 90, 95, 99%.

So gen 0 and 1 are finding hardly any trash since the 75th percentile is less
than 0.5%.

The gen 2 numbers are surprising too, the 50th percentile is 54.6%? That seems
really high to me. With those numbers, having only a single generation is
probably better.

Does anyone know where the numbers came from?

Originally I just picked some thresholds without too much thought. There was
some futher tweaking of them but, AFAIK, no one ever did a lot of analysis to
determine what they should be.

[...] those are the statistics that I showed: is the % of trashed colledted
objects over the objects in that generation.

Right, I misunderstood. Those numbers are useful and surprising, at least to
me.

@pablogsal
Copy link
Member Author

pablogsal commented Jan 3, 2023

It would be interesting to gather some data about that. I.e. when an object
dies which GC generation is in. Maybe need a special build that adds a field
to the GC head to keep track of the current generation. Otherwise it is too
expensive to compute.

Oh, fantastic suggestion. let me play with this a bit and will maybe propose a PR

Right, I misunderstood. Those numbers are useful and surprising, at least to
me.

Let me try to gather them again and point you to a branch to reproduce them so we can play with different workloads. Maybe we could also add a configure flag or something to have this in the interpreter itself.

@carljm
Copy link
Member

carljm commented Jan 4, 2023

If we develop a statistics-gathering patch that's been well-tested and we're confident it collects useful data, assuming I can reasonably backport it to Python 3.8 or 3.10, I should be able to run it on an IG prod host and collect stats from our workload.

pablogsal added a commit to pablogsal/cpython that referenced this issue Jan 11, 2023
pablogsal added a commit to pablogsal/cpython that referenced this issue Jan 12, 2023
pablogsal added a commit to pablogsal/cpython that referenced this issue Jan 12, 2023
pablogsal added a commit to pablogsal/cpython that referenced this issue Jan 12, 2023
…en --enable-pystats is provided

Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
nascheme added a commit to nascheme/cpython that referenced this issue Jan 12, 2023
Port of Pablo's pythonGH-100403 to Python 3.11.x.
@nascheme
Copy link
Member

I made a small change to allow the threshold to be set by an env var:

diff --git a/Modules/gcmodule.c b/Modules/gcmodule.c
index 829d6d9d004..2d8e22a795f 100644
--- a/Modules/gcmodule.c
+++ b/Modules/gcmodule.c
@@ -195,8 +195,15 @@ _PyGC_InitState(GCState *gcstate)
     };
     gcstate->generation0 = GEN_HEAD(gcstate, 0);
     INIT_HEAD(gcstate->permanent_generation);
-
 #undef INIT_HEAD
+
+    char *s = Py_GETENV("PYTHONGCTHRESHOLD");
+    if (s) {
+        long n = atol(s);
+        if (n > 0) {
+            gcstate->generations[0].threshold = n;
+        }
+    }
 }
 
 

I gathered stats for a web application and for the Django test suite, at 700 and 7000 for the GC threshold. The black application doesn't seem to create too much cyclic garbage while it runs so I expect the thresholds could be set high for it.

I added a new chart "Collected per run time" which is "objects collected / "running time", e.g. objects collected per ms of collection time.

Web app @ threshold 700:
webapp threshold=700

Web app @ threshold 7000:
webapp threshold=7000

Using a threshold of 7000 makes youngest collection much more effective, >600 objects collected/per ms of runtime vs <200.

Some investigation shows that the HTML form framework used by the web app creates some reference cycles. So each web page generated that contains HTML forms can create work the GC. Black creates some cyclic garbage too (NFAState, LineGenerator, Leaf, Node) but not very much.

Django test suite @ 700:
django tests threshold=700

Django test suite @ 7000:
django tests threshold=7000

Unlike the web app, Django doesn't benefit as much from using 7000 as the threshold. Total GC run time goes down, which is good.

I would like to do more testing and analysis. My initial impression is that 700 is too small for the threshold, especially for apps that don't create a lot of cyclic trash. Also, it seems we do a bunch of first gen collections before the interpreter finishing initializing. That seems like some low hanging fruit to claim. E.g. disable automatic GC until startup is done. We are not likely creating a lot of cyclic garbage.

Again, I think we need to think more about what are our objectives. Some possible ones: minimum runtime (i.e. max throughput), minimize GC pauses (min latency), minimize memory usage (minimize time cyclic trash remains alive). With threshold as 700, first generation collections seem to take a fraction of 1 ms. Small enough that I doubt anyone notices them. If we increase the size so it takes 10 ms, people may notice. E.g. 100 FPS is 10 ms per frame.

@nascheme
Copy link
Member

Some more testing, trying different thresholds. Script to generate:
https://gist.github.com/nascheme/8652a77af79837a2cd6e24e88b457fc6

threshold gen num collections mean collected cycles mean time [ns] mean total objects total time [s] time per total [ns/obj] time per collected [ns/obj] ratio collected
500 0 16797 35 49,581 736 0.83 67 1415 5%
500 1 1526 1,002 616,292 4,832 0.94 128 615 21%
500 2 109 21,842 69,297,688 213,383 7.55 325 3173 10%
5000 0 1039 1,793 907,897 5,701 0.94 159 506 31%
5000 1 94 21,532 10,784,989 34,436 1.01 313 501 63%
5000 2 12 60,610 99,815,000 255,447 1.20 391 1647 24%
50000 0 101 35,198 17,684,436 49,616 1.79 356 502 71%
50000 1 9 105,951 87,971,778 153,789 0.79 572 830 69%
50000 2 4 49,514 90,719,000 213,088 0.36 426 1832 23%

I'm not sure why the "time per total" number goes up when the threshold is higher. I would expect some efficiency gain with larger number of objects. For this app, I would say 5000 is near the sweet spot. With threshold 50,000, the mean time to collect gen0 is 17 ms vs 0.9 ms and "time per collected object" only slightly better.

@pablogsal
Copy link
Member Author

pablogsal commented Jan 18, 2023

Some more stats based on this super mario simulator based on pygame:

https://github.com/mx0c/super-mario-python

Base

base
Generation 0: 797 objects collected, 0.02 s, 86534 total objects
Generation 1: 433 objects collected, 0.01 s, 50447 total objects
Generation 2: 32132 objects collected, 0.04 s, 136712 total objects
Total GC time: 0.07 s

1st gen @ 7000

new

Generation 0: 803 objects collected, 0.01 s, 74589 total objects
Generation 1: 333 objects collected, 0.01 s, 48351 total objects
Generation 2: 32271 objects collected, 0.04 s, 136765 total objects
Total GC time: 0.06 s

It seems that this application is largely unaffected by the change, modulo some outliers in the first gen.

@nascheme
Copy link
Member

A few more thoughts about this. The 700 threshold value was set long ago. The commit that set the value:

commit 3263dc2b1538ddd948bb60a5ccee817e512a3671
Author: Jeremy Hylton <jeremy@alum.mit.edu>
Date:   Tue Sep 5 15:44:50 2000 +0000

    compromise value for threshold0: not too high, not too low

diff --git a/Modules/gcmodule.c b/Modules/gcmodule.c
index 28263c07228..06f51ed6cd7 100644
--- a/Modules/gcmodule.c
+++ b/Modules/gcmodule.c
@@ -40,7 +40,7 @@ static int generation = 0; /* current generation being collected */
 
 /* collection frequencies, XXX tune these */
 static int enabled = 1; /* automatic collection enabled? */
-static int threshold0 = 5000; /* net new containers before collection */
+static int threshold0 = 700; /* net new containers before collection */
 static int threshold1 = 10;  /* generation0 collections before collecting 1 */
 static int threshold2 = 10;  /* generation1 collections before collecting 2 */
 

At that time, only a few "container objects" had GC support (e.g. lists, tuples, dicts). Many more types have gained support for cyclic GC. That makes collections happen more often than they originally did. Also, computer memory space has expanded significantly in 20 years. So, it seems likely 700 is too small as a default.

The fact that threshold1 and threshold2 are based on #-of-collections rather than #-of-objects makes tuning more complex. It would be nice to base them objects as well but when an object is freed, we don't know which generation it is in. We always subtract one from the gen0 count.

Regarding the dynamic tuning of the threshold, what are the reasons to not just always use the high threshold limit? You could argue that you are using more memory that could be freed if you collect more often. However, if you are creating a lot of cyclic garbage, the threshold will be reached sooner and maybe a high threshold is better.

There is a danger with a high threshold that a lot of memory can be used by cyclic trash that is not accounted for by tracking #-of-objects. E.g. you have some large array objects. Then, a higher GC threshold will make your program use much more memory.

It would be nice if we could track total memory use rather than #-of-objects as a threshold. I tried adding this function:

size_t
_PyMem_GetArenaBytes(void)
{
    if (!_PyMem_PymallocEnabled()) {
        return 0;
    }
    /* # of arenas actually allocated. */
    size_t narenas = 0;
    for (uint i = 0; i < maxarenas; ++i) {
        uintptr_t base = arenas[i].address;

        /* Skip arenas which are not allocated. */
        if (arenas[i].address == (uintptr_t)NULL)
            continue;
        narenas += 1;
    }
    return narenas * ARENA_SIZE;
}

And then changing GC triggering to check both #-of-objects threshold and also the incremental size of _PyMem_GetArenaBytes(). Counting arenas is faster than looking a pools and probably good enough. It doesn't really work though because this doesn't account for allocations larger than the obmalloc max size. I tried adding memory tracking for _PyObject_Malloc, _PyObject_Calloc, _PyObject_Free. The accounting is a bit tricky. If we switched to mimalloc, we could track total memory use more easily, I think. Or, we could use OS APIs like what's used by Modules/resource.c, e.g. to get data-resident-size. Should we be looking at other resources, like file descriptors? Since we strongly encourage using context handlers, maybe not.

Even if we can track memory used by PyMem/PyObject_Malloc, it is still possible extensions allocate memory using their own malloc and we wouldn't see that. So, maybe using an OS API for find the process memory size is better. On platforms with OS support, we could base the GC collection on both #-of-objects and increase in process memory size. On other platforms, use just #-of-objects.

@pablogsal
Copy link
Member Author

pablogsal commented Jan 19, 2023

Regarding the dynamic tuning of the threshold, what are the reasons to not just always use the high threshold limit? You could argue that you are using more memory that could be freed if you collect more often. However, if you are creating a lot of cyclic garbage, the threshold will be reached sooner and maybe a high threshold is better.

The idea was that we don't want to impact latency too much if there are a lot of cycle objects. If we just grow and there is A LOT of cycles, then the latency will grow unbounded. In those cases we want to back off a bit and do more small collections.

Maybe there are better metrics to make "dynamic" than the one I am using there, though.

And then changing GC triggering to check both #-of-objects threshold and also the incremental size of _PyMem_GetArenaBytes(). Counting arenas is faster than looking a pools and probably good enough. It doesn't really work though because this doesn't account for allocations larger than the obmalloc max size. I tried adding memory tracking for _PyObject_Malloc, _PyObject_Calloc, _PyObject_Free. The accounting is a bit tricky. If we switched to mimalloc, we could track total memory use more easily, I think. Or, we could use OS APIs like what's used by Modules/resource.c, e.g. to get data-resident-size. Should we be looking at other resources, like file descriptors? Since we strongly encourage using context handlers, maybe not.Even if we can track memory used by PyMem/PyObject_Malloc, it is still possible extensions allocate memory using their own malloc and we wouldn't see that. So, maybe using an OS API for find the process memory size is better. On platforms with OS support, we could base the GC collection on both #-of-objects and increase in process memory size. On other platforms, use just #-of-objects.

This is interesting but still has some challenges. One of the problems is that unless we only look at pools that are assigned to GC objects we can make huge mistakes when estimating size. The OS APIs report resident size but that can be too deceiving for many reasons and that also has the problem that's very difficult to transform into some number that represents something that the GC can do about.

For example, I can imagine a situation where someone mmaped a huge chunk of memory (or some C extension) and that causes the GC to run in crazy mode because it thinks that is able to clean some of that memory when maybe is not possible. The same problem happens with arenas or the PyMem/PyObject_Malloc APIs: we should only account for memory of gc objects because someone could have a gigantic string and then the GC would run in crazy mode.

Seems that mimalloc could help us if we force everyone to use the python memory apis and we segment objects based in GC and non-GC, but that has a lot of other problems.

Another approach is running getsizeof() on every object that is added to the GC, but that can slow down everything by a lot and also can be inaccurate in some cases.

@nascheme
Copy link
Member

The idea was that we don't want to impact latency too much if there are a lot of cycle objects. If we just grow and there is A LOT of cycles, then the latency will grow unbounded. In those cases we want to back off a bit and do more small collections.

I don't mean to increase the threshold without bound so the latency shouldn't be unbounded. Just set the threshold to the high value and keep it there, was my idea. If a higher threshold works when there are not a lot of trash cycles, why doesn't it also work when are many cycles?

For example, I can imagine a situation where someone mmaped a huge chunk of memory (or some C extension) and that causes the GC to run in crazy mode because it thinks that is able to clean some of that memory when maybe is not possible. The same problem happens with arenas or the PyMem/PyObject_Malloc APIs: we should only account for memory of gc objects because someone could have a gigantic string and then the GC would run in crazy mode.

I was thinking to use getrusage() and ru_maxrss. I was checking the incremental increase in RSS. So once the GC runs, it wouldn't run again until RSS increased by some amount, I was using 5 MB. Also, my condition for running the GC was the object count and the increase in RSS.

Something like:

--- a/Modules/gcmodule.c
+++ b/Modules/gcmodule.c
@@ -180,6 +180,10 @@ _PyInitGCStats() {
 }
 #endif
 
+/* likely should be some multiple larger than obmalloc ARENA_SIZE */
+static size_t pymem_size_threshold = 5000000;
+static size_t pymem_size_last = 0;
+
 void
 _PyGC_InitState(GCState *gcstate)
 {
@@ -2435,6 +2450,9 @@ PyObject_IS_GC(PyObject *obj)
     return _PyObject_IS_GC(obj);
 }
 
+// implemented in obmalloc.c, estimate RSS size of process
+extern size_t _PyMem_GetTotalUsedBytes(void);
+
 void
 _PyObject_GC_Link(PyObject *op)
 {
@@ -2452,9 +2470,19 @@ _PyObject_GC_Link(PyObject *op)
         !gcstate->collecting &&
         !_PyErr_Occurred(tstate))
     {
-        gcstate->collecting = 1;
-        gc_collect_generations(tstate);
-        gcstate->collecting = 0;
+        size_t pymem_size = _PyMem_GetTotalUsedBytes();
+        size_t pymem_inc = pymem_size - pymem_size_last;
+        if (pymem_inc > pymem_size_threshold) {
+            gcstate->collecting = 1;
+            gc_collect_generations(tstate);
+            gcstate->collecting = 0;
+            pymem_size_last = _PyMem_GetTotalUsedBytes();
+        } else {
+            gcstate->generations[0].count = 0;
+        }
     }
 }

Another approach is running getsizeof() on every object that is added to the GC, but that can slow down everything by a lot and also can be inaccurate in some cases.

That sounds way too expensive to me. You could walk obmalloc pools and add up what's used that way (you know size class for each pool). Then have a way of tracking sizes of large objects. Still sounds too slow to me.

@pablogsal
Copy link
Member Author

I was thinking to use getrusage() and ru_maxrss. I was checking the incremental increase in RSS. So once the GC runs, it wouldn't run again until RSS increased by some amount, I was using 5 MB. Also, my condition for running the GC was the object count and the increase in RSS.

But ru_maxrss will spike for a lot of unrelated reasons, no? Opening files, dlopening shared objects, shared libraries allocating big memory pools...etc

@gpshead
Copy link
Member

gpshead commented Jan 20, 2023

... other non-Python threads doing useful things... etc.

@gpshead
Copy link
Member

gpshead commented Jan 20, 2023

Fundamentally there cannot be a fixed threshold that is perfect. Every application has a different data structure shape. And static value is a compromise no matter when it is chosen or what it is set to. That doesn't mean changing it is a bad idea.

Similarly, any dynamic value is probably prone to an antagonistic load that leads to it doing worse in some situations. Again, compromise. This is why we have gc module APIs. So that applications that actually ever get to the point of identifying a need to improve their Python GC behavior rather than lower hanging fruit in their stack have some metrics to read and dials to twist.

I suggest focusing first on exposing more detailed metrics from the GC than our existing get_stats() and get_count() APIs offer. As new public gc module APIs. So that some history of time spent doing what and when can be presented.

ex: Is there a way to have --enable-pystats like data and more always enabled without a notable performance hit? Or at least something that can be enabled by an API call or environment variable rather than a need to recompile with a configure #define?

@nascheme
Copy link
Member

I made a quick-and-dirty branch to test my idea of also checking for an increase in process memory use. I decided that the simplest thing to do with malloc()/calloc()/realloc()/free() is to just count the bytes allocated and not deduct on free(). I'm using the allocated bytes as an additional condition for triggering GC and so we never do any more collections, only fewer. Estimating what's used by obmalloc is easier, just count up used arenas.

Branch:
https://github.com/nascheme/cpython/tree/gc_estimate_process_size

A re-run of my web app benchmark, similar to the table in my above comment. Notice that for the 500 object threshold (first row), the average size of gen0 is 5,206. That means that gen0 collection is being skipped about 9 of 10 times because the memory estimate returned by _PyMem_EstimateTotalBytes didn't increase enough. I'm using 5 MB as the memory increase threshold.

An idea would be to add gc.set_memory_threshold() and set it to something (e.g. few megabytes) by default. Setting it to 0 would revert to the current Python GC behaviour, only considering an increase in object count in gen0. The logic is that if the value returned by _PyMem_EstimateTotalBytes is not going up too much, we can defer collection of gen0 and hope that reference counting frees that memory. The risk is we are not counting resource usage not captured by _PyMem_EstimateTotalBytes. However, this approach seems less risky to me compared with just increasing the object count threshold.

threshold gen num collections mean collected cycles mean time [ns] mean total objects total time [s] time per total [ns/obj] time per collected [ns/obj] ratio collected
500 0 1176 1,398 802,900 5,206 0.94 154 574 27%
500 1 106 21,115 9,826,632 30,416 1.04 323 465 69%
500 2 13 37,056 82,745,154 229,967 1.08 360 2233 16%
5000 0 692 2,790 1,403,973 7,847 0.97 179 503 36%
5000 1 62 32,909 16,666,484 45,656 1.03 365 506 72%
5000 2 9 45,681 89,475,778 234,440 0.81 382 1959 19%
50000 0 96 35,462 17,017,979 49,757 1.63 342 480 71%
50000 1 8 102,482 81,490,750 151,906 0.65 536 795 67%
50000 2 4 57,971 95,446,250 217,941 0.38 438 1646 27%

@nascheme
Copy link
Member

Is there a way to have --enable-pystats like data and more always enabled without a notable performance hit? Or at least something that can be enabled by an API call or environment variable rather than a need to recompile with a configure #define?

I think so. Some stuff that Pablo's stats branch collects might be too expensive. However, I think we could collect/report some more useful things without a lot of extra overhead. Then we can have a gc.set_debug() or -X command line flag to turn it on. Right now we are flying a bit blind in terms of how real apps behave.

@pablogsal
Copy link
Member Author

I can modify my PR to do that in addition to the pystats collection.

What things you think would be useful to have in the regular builds that will not be too prohibitive?

@pablogsal
Copy link
Member Author

I have modified #100958 to be activated via a -X option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs)
Projects
None yet
Development

No branches or pull requests

7 participants