-
-
Notifications
You must be signed in to change notification settings - Fork 31k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving GC collections: dynamic thresholds, single generation gc and time barriers #100403
Comments
Here is a proof of concept for these changes if you want to play with it: |
I agree that current threshold is not good for many applications. But I still think generations are good for some cases. Python creates a lot of cyclic objects that lives until process shutdown. For example, modules, classes, functions, namespace dicts, and annotations. (PEP 563 eliminates most cycle of annotations, but it would deprecated by PEP 649.) For large application using SQLAlchemy, full GC takes more than 10ms. Of course, permanent generation can be used to avoid full GC traces permanent objects. But it requires manual GC tuning... |
Yeah, but they key is that is not clear if a single run of the last generation (or even a few more) is going to be better than a lot of inefficient runs of the lower generations + a few of the full generation. I am sure there are examples of how both ways can work but I think we need to focus on something that is more common because otherwise we can always find workloads that can give us reasons to do one approach or the other. Do you have a benchmark of a SQLAlchemy application?
The other idea is adjust the thresholds dynamically based on how effective a collection has been. For instance, we can set a target for efficiency (defined as how many objects are collected divided by the size of the generation) and we adjust the threshold for the next collection based on how much the actual collection separates from the target. For example, if a target is 25% and a collection collects 5% we adjust the threshold with a scale factor f(5/25) making the generation threshold bigger and coversely if we collect 75% we reduce the size by f(75/25). What is the scale factor is up for discuss. What do you think of this approach? |
Benchmark of #100404 (:warning: please take into account that this is still a DRAFT and we may be missing something All benchmarks:
Benchmark hidden because not significant (6): scimark_lu, bench_mp_pool, sympy_sum, pickle, sympy_str, pathlib |
Here's the results on the Faster CPython benchmarking system (basically confirming the above plus or minus the usual noise): https://github.com/faster-cpython/benchmarking-public/tree/main/results/bm-20221221-3.12.0a3+-663a965 |
Here is another different idea: #100421 This PR implements time barriers in the GC. The main idea is to control how much time the GC represents over total runtime. To do that we calculate the running average of the GC time, the total time since the GC was called and if this ratio is bigger than a specific threshold (I suppose this can be configurable) then we hold the GC from running. The idea is that if the GC is running too much then we hold it off. I think configuring this threshold may be much easier than changing the knobs of the generation thresholds because is much easier to understand what this is achieving. I think this is not going to be as exciting as the previous ideas and the system call for the time has the potential to slow down some use cases unless we combine it with other approaches. |
And here is yet another idea: #100422 This implements dynamic thresholds based on how successful a specific collection was. When we do a collection we calculate how many objects have been collected divided by how many objects were initially considered. Then we update the threshold for that generation based on the difference between the current ratio of collected to total objects and a given configurable target ratio. The update uses a sigmoid that saturates between 0 and 1 and peaks over given values of This has the advantage over the first method that keeps the generations. Benchmarks for this idea: All benchmarks:
This is how it compares to the first method: All benchmarks:
Benchmark hidden because not significant (57): json_dumps, pathlib, django_template, fannkuch, pickle_list, sqlglot_normalize, scimark_fft, spectral_norm, pickle_dict, scimark_lu, chameleon, chaos, scimark_monte_carlo, telco, regex_compile, pprint_safe_repr, crypto_pyaes, regex_v8, raytrace, sqlglot_optimize, go, pprint_pformat, hexiom, xml_etree_process, bench_thread_pool, scimark_sparse_mat_mult, pickle, pidigits, sqlite_synth, scimark_sor, unpickle, sqlglot_parse, mdp, sympy_str, html5lib, nqueens, genshi_xml, json_loads, sympy_sum, sqlglot_transpile, regex_effbot, unpack_sequence, bench_mp_pool, sympy_integrate, deepcopy_reduce, logging_simple, sympy_expand, coverage, richards, nbody, pyflate, xml_etree_generate, logging_format, deepcopy_memo, unpickle_pure_python, generators, deepcopy |
I suspect it really depends on the application. It could be the case that some (many?) applications could completely turn off the cyclic GC and only rely on ref counting. However, I imagine there are applications that generate cyclic garbage quickly and would perform quite badly with a single generation GC. It takes quite a lot more time to do the full collection and the young generation is effective at collecting recently created cyclic garbage. The dynamic threshold sounds like a good idea to me. The challenge will be to make something that works for all CPython users. We have something that reduces the number of full collections, from gcmodule.c:
|
What do you think of the approach of I think this could be interesting because I think is easy to understand how it changes the dynamic thresholds and I also think is easy to understand what the knobs will try to achieve in the gc behaviour. Also it had good performance numbers in the benchmark suite |
@nanjekyejoannah as well (who is also interested in GC if I recall correctly?) |
Bench marking real applications is good but I think we should also run some simulations or do some analysis to estimate how the GC will be behave under different scenarios. To that end, I started writing a simulator. It is still crude and likely buggy. To help with simulation and modelling, some profiling would help to determine the cost of minor and major GC runs. My first guess is that the cost is roughly O(n) where n is the number of objects in the generation. If that's correct, I don't see how the dynamic threshold change will help, at least how it is written. Each newly object gets examined once in the young generation and so doing it less often doesn't save any time, it just makes the gen 0 collection take longer. That's another thing to be concerned about: GC pause time. Making gen 0 and gen 1 large will increase GC pause time. It is a "stop the world" GC and some applications might not be happy with much longer pauses. In my basic simulator, I set the fraction of objects that are trash. If that's roughly how things work, I don't think the threshold algorithm in gh-100422 is what we want. Say 2% of newly created objects immediately become cyclic trash. Since it is targeting 20% of the objects in gen 0 as trash, it will keep increasing the threshold until it reaches the upper limit. Making the generation larger does not increase the % of objects that are trash and I don't see how that's likely in any scenario. Perhaps the threshold needs to be based on something like (number of trash objects found / number of total live objects). However, that approach would result in GC pauses becoming long as the total number of live objects gets large. BTW, it looks to me like gh-100422 sets thresholds for older generations so high that collections basically don't happen. Those thresholds are based on the number of collections, not number of objects added to the generation. Some ideas on how to make progress:
Regarding your "Statistics" tables, those are interesting but it's not exactly clear to me what the numbers mean. Is the "count" directly from the GC generation count field? If so, I think gen 0 count is based on the number of objects in the generation whereas the gen 1 and gen 2 are based on number of collections. And I guess the rest of the table rows are based on the number of trash objects found. If that's all correct, gen 0 is not so bad as you think at finding trash whereas gen 1 is pretty useless. Maybe we should only have two generations. If would be interesting to see statistisc based on the fraction of objects found to be trash and the number of objects in each generation. Also, the amount of time for the GC vs size of generation. We might deduce the O(...) behaviour from that. |
But under that assumption, if you make simulations that have a constant percentage of trash per collection, the size of the generation doesn't matter because the only thing that would determine the time in GC over a fixed time is the total number of objects that we have seen in the GC, and unless you add more assumptions that should not matter. On the other hand, I think the key here is that having less GC collections allows refcount to kill most of the stuff that's not cyclic trash and therefore the collections would be more efficient. I think this is the key aspect that can optimize the GC time in general and you won't be able to capture this with a simulation that simulates fixed percentages.
Apologies for the lack of clarity on the tables. "Count" is just the number of data points but that value alone doesn't tell you much so I would suggest ignoring it. The rest are the % of collected objects over the number of objects available at the start of a collection of that generation. As this is done across many different runs then you get different statistics. For example: |mean | 1.192775| Means that on average a collection of this generation collects 1% of objects in the generation, with a standard deviation of 3.5%. The 25/75 are the quantiles of the distribution and min and max are the efficiency of the worst and best collection. In this case the worst collection collected 0% of the objects and the best collection collected 86% of the objects.
Yeah, I suspect that most of the improvements come from making the generations bigger. That's another question: maybe the generations we have currently are too small for many applications. Maybe just changing the numbers may help a bit. Coming up with 'good numbers' I suppose is going to be the most challenging part but I cannot see any proof that the status quo is maximizing any specific objective. Does anyone know where the numbers came from?
Yeah, although the idea is that over the long run, the actual total time of GC should be less. I assume these are two different axes to optimize, depending on your application.
I may be missing something but those are the statistics that I showed: is the % of trashed colledted objects over the objects in that generation. |
Ah, that's a good point and would explain how your change could improve It would be interesting to gather some data about that. I.e. when an object
Oh, I totally misunderstood then. That changes things a lot. The distribution So gen 0 and 1 are finding hardly any trash since the 75th percentile is less The gen 2 numbers are surprising too, the 50th percentile is 54.6%? That seems
Originally I just picked some thresholds without too much thought. There was
Right, I misunderstood. Those numbers are useful and surprising, at least to |
Oh, fantastic suggestion. let me play with this a bit and will maybe propose a PR
Let me try to gather them again and point you to a branch to reproduce them so we can play with different workloads. Maybe we could also add a configure flag or something to have this in the interpreter itself. |
If we develop a statistics-gathering patch that's been well-tested and we're confident it collects useful data, assuming I can reasonably backport it to Python 3.8 or 3.10, I should be able to run it on an IG prod host and collect stats from our workload. |
…stats is provided
…able-pystats is provided
…en --enable-pystats is provided Signed-off-by: Pablo Galindo <pablogsal@gmail.com>
Port of Pablo's pythonGH-100403 to Python 3.11.x.
I made a small change to allow the threshold to be set by an env var:
I gathered stats for a web application and for the Django test suite, at 700 and 7000 for the GC threshold. The I added a new chart "Collected per run time" which is "objects collected / "running time", e.g. objects collected per ms of collection time. Using a threshold of 7000 makes youngest collection much more effective, >600 objects collected/per ms of runtime vs <200. Some investigation shows that the HTML form framework used by the web app creates some reference cycles. So each web page generated that contains HTML forms can create work the GC. Black creates some cyclic garbage too (NFAState, LineGenerator, Leaf, Node) but not very much. Unlike the web app, Django doesn't benefit as much from using 7000 as the threshold. Total GC run time goes down, which is good. I would like to do more testing and analysis. My initial impression is that 700 is too small for the threshold, especially for apps that don't create a lot of cyclic trash. Also, it seems we do a bunch of first gen collections before the interpreter finishing initializing. That seems like some low hanging fruit to claim. E.g. disable automatic GC until startup is done. We are not likely creating a lot of cyclic garbage. Again, I think we need to think more about what are our objectives. Some possible ones: minimum runtime (i.e. max throughput), minimize GC pauses (min latency), minimize memory usage (minimize time cyclic trash remains alive). With threshold as 700, first generation collections seem to take a fraction of 1 ms. Small enough that I doubt anyone notices them. If we increase the size so it takes 10 ms, people may notice. E.g. 100 FPS is 10 ms per frame. |
Some more testing, trying different thresholds. Script to generate:
I'm not sure why the "time per total" number goes up when the threshold is higher. I would expect some efficiency gain with larger number of objects. For this app, I would say 5000 is near the sweet spot. With threshold 50,000, the mean time to collect gen0 is 17 ms vs 0.9 ms and "time per collected object" only slightly better. |
Some more stats based on this super mario simulator based on pygame: https://github.com/mx0c/super-mario-python Base
1st gen @ 7000Generation 0: 803 objects collected, 0.01 s, 74589 total objects It seems that this application is largely unaffected by the change, modulo some outliers in the first gen. |
A few more thoughts about this. The 700 threshold value was set long ago. The commit that set the value:
At that time, only a few "container objects" had GC support (e.g. lists, tuples, dicts). Many more types have gained support for cyclic GC. That makes collections happen more often than they originally did. Also, computer memory space has expanded significantly in 20 years. So, it seems likely 700 is too small as a default. The fact that threshold1 and threshold2 are based on #-of-collections rather than #-of-objects makes tuning more complex. It would be nice to base them objects as well but when an object is freed, we don't know which generation it is in. We always subtract one from the gen0 count. Regarding the dynamic tuning of the threshold, what are the reasons to not just always use the high threshold limit? You could argue that you are using more memory that could be freed if you collect more often. However, if you are creating a lot of cyclic garbage, the threshold will be reached sooner and maybe a high threshold is better. There is a danger with a high threshold that a lot of memory can be used by cyclic trash that is not accounted for by tracking #-of-objects. E.g. you have some large array objects. Then, a higher GC threshold will make your program use much more memory. It would be nice if we could track total memory use rather than #-of-objects as a threshold. I tried adding this function:
And then changing GC triggering to check both #-of-objects threshold and also the incremental size of Even if we can track memory used by PyMem/PyObject_Malloc, it is still possible extensions allocate memory using their own malloc and we wouldn't see that. So, maybe using an OS API for find the process memory size is better. On platforms with OS support, we could base the GC collection on both #-of-objects and increase in process memory size. On other platforms, use just #-of-objects. |
The idea was that we don't want to impact latency too much if there are a lot of cycle objects. If we just grow and there is A LOT of cycles, then the latency will grow unbounded. In those cases we want to back off a bit and do more small collections. Maybe there are better metrics to make "dynamic" than the one I am using there, though.
This is interesting but still has some challenges. One of the problems is that unless we only look at pools that are assigned to GC objects we can make huge mistakes when estimating size. The OS APIs report resident size but that can be too deceiving for many reasons and that also has the problem that's very difficult to transform into some number that represents something that the GC can do about. For example, I can imagine a situation where someone mmaped a huge chunk of memory (or some C extension) and that causes the GC to run in crazy mode because it thinks that is able to clean some of that memory when maybe is not possible. The same problem happens with arenas or the PyMem/PyObject_Malloc APIs: we should only account for memory of gc objects because someone could have a gigantic string and then the GC would run in crazy mode. Seems that mimalloc could help us if we force everyone to use the python memory apis and we segment objects based in GC and non-GC, but that has a lot of other problems. Another approach is running |
I don't mean to increase the threshold without bound so the latency shouldn't be unbounded. Just set the threshold to the high value and keep it there, was my idea. If a higher threshold works when there are not a lot of trash cycles, why doesn't it also work when are many cycles?
I was thinking to use Something like:
That sounds way too expensive to me. You could walk obmalloc pools and add up what's used that way (you know size class for each pool). Then have a way of tracking sizes of large objects. Still sounds too slow to me. |
But ru_maxrss will spike for a lot of unrelated reasons, no? Opening files, dlopening shared objects, shared libraries allocating big memory pools...etc |
... other non-Python threads doing useful things... etc. |
Fundamentally there cannot be a fixed threshold that is perfect. Every application has a different data structure shape. And static value is a compromise no matter when it is chosen or what it is set to. That doesn't mean changing it is a bad idea. Similarly, any dynamic value is probably prone to an antagonistic load that leads to it doing worse in some situations. Again, compromise. This is why we have I suggest focusing first on exposing more detailed metrics from the GC than our existing get_stats() and get_count() APIs offer. As new public ex: Is there a way to have --enable-pystats like data and more always enabled without a notable performance hit? Or at least something that can be enabled by an API call or environment variable rather than a need to recompile with a configure #define? |
I made a quick-and-dirty branch to test my idea of also checking for an increase in process memory use. I decided that the simplest thing to do with Branch: A re-run of my web app benchmark, similar to the table in my above comment. Notice that for the 500 object threshold (first row), the average size of gen0 is 5,206. That means that gen0 collection is being skipped about 9 of 10 times because the memory estimate returned by An idea would be to add
|
I think so. Some stuff that Pablo's stats branch collects might be too expensive. However, I think we could collect/report some more useful things without a lot of extra overhead. Then we can have a |
I can modify my PR to do that in addition to the pystats collection. What things you think would be useful to have in the regular builds that will not be too prohibitive? |
I have modified #100958 to be activated via a |
In the pursuit of trying to optimize GC runs it has been observed that the weak generational hypothesis may not apply that well to Python. This is because, according to this argument, in the presence of a mixel cycle GC + refcount GC strategy, young objects are mostly cleaned up by reference count, not by the GC. Is important as well that there is no segregation between the GC strategies and that the cycle GC needs to deal with objects that according to this will be mainly cleaned up by reference count alone.
This questions the utility of segregating GC by generations and indeed there is some evidence of this. I have been benchmarking the percentage of success of different generations in some programs (such as
blach
andmypy
and a bunch of HTTP servers) and the success rate of the lower generations is generally small. Here is an example of runningblack
over all the standard library:Statistics for generation 0
Statistics for generation 1
Statistics for generation 2
I am currently investigating if having a single generation with a dynamic threshold that is similar to the strategy that we use currently for the last generation would be generally better to get better performance.
What do you think?
Linked PRs
The text was updated successfully, but these errors were encountered: