-
-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-37448: Add radix tree implementation for obmalloc address_in_range(). #14474
Conversation
Food for thought: consider making L1 fatter? My understanding is that Intel and AMD chips are limited to at most 48 physical address bits now, and they agreed that, for now, the top 16 bits of virtual addresses must be copies of bit 2**47. Which was intentional, to stop "clever" OS authors and users from using the higher-order bits for something else, thus making future expansion of physical address bits essentially impossible. Jerks - we could have used the sign bit to mean "address in range" 😉. So the top 17 bits will all be 0 or all be 1, and any L1 width <= 17 will never see more than those 2 values. On Windows, I expect they'll only see "all 0", because the OS reserves the "all 1" patterns for its own use - don't know about Linux. I bet the next step will be 56-bit physical addresses. |
Note that on a 32-bit box, with 256K arenas there are only 2**(32-18) = 2**14 = 16K ideal arena addresses. So a static 1-level tree would be fine. |
Regarding making the root node larger, that sounds like a good idea. I'm not sure how large is good though. It increases the virtual set size of the process. I compared VSS and RSS of the base Python (current address_in_range()), small root radix node (MAP1_BITS = 16) and MAP1_BITS = 20, using a program that does basically nothing.
Below is my quick and dirty change to use 20 bits for the root node. I'm hesitant to increase the VSS. On Linux it is not much of a problem because Linux typically is very willing to overcommit on RAM. However, other operating systems might be different. E.g. I think FreeBSD is more conservative. Not sure about Windows. The idea is to avoid having to kill programs due to an out-of-memory situation. If they don't overcommit, they will refuse to start new programs if the VSS is too large. At least, that's my understanding.
|
Regarding a 32-bit version, I'm not sure what to do there. We should at least do some benchmarking on common 32-bit platforms. I would guess most common would be 32-bit Windows on an amd64 processor or Linux on a 32-bit ARM processor. In those cases, maybe we could keep POOL_SIZE <= 4096 and keep the existing address_in_range()? |
I had in mind more widening L1 by narrowing it ;-) That is, throw away the top 16 address bits entirely (in debug mode verifying they're the same as the top bit that remains). Then we only have 48 - 18 = 30 arena prefix bits for the tree levels to resolve. Yes, that's aggressive. But then, best I can tell, the entire useful effect of widening L1 to 20 bits as-is could be gotten by shrinking it to 4 bits instead. Of course the tree would have to be reworked a bit when physical address bits increase. Keep it 3 levels, and perhaps it could just be a matter of changing a FYI, there is no overcommit on Windows, and no OOM killer. While doing a "commit" with So increasing (the moral equivalent of) VSS there is even less atractive. But there's something to be said for writing code that exploits hardware realities: there are only 48 address bits 😉. Then VSS can shrink. |
Ah, that is very pragmatic. Thank you for explaining it to me. ;-P I have made a patch and done some testing. Here is an updated table. Obviously the radix tree is tiny in the case that it only uses 30 bits. I have root node using 10 bits.
|
Cool! OTOH ... we're making much stronger assumptions now, and it's unclear how long they'll last. Or even if they all apply even now. For example, I know that the 64-bit PowerPC architecture still lives on in some very high-end systems, but don't know whether Python is used on them, or what their virtual address conventions are. FYI, looks like "the next step" in x64-land will be 57-bit physical addresses rather than 56: https://en.wikipedia.org/wiki/Intel_5-level_paging And Linux has apparently already been patched for that, but in anticipation - no chips can use it yet. Prudent: yes, leave 32-bit Python entirely alone. But also make it very easy to revert 64-bit Python to the status quo too (e.g., if |
I found this recent bug fix: powerpc/mm/64s/hash: Reallocate context ids on fork I think that means with certain setups, user mode programs on POWER9 processors can see addresses in the 4 PB range (52 bits?).
Yes, the patch is in the current release: You will be amused to read that you are not the first to think of hiding flags in the high bits of pointers.
There is a define for that: WITH_RADIX_TREE. Easy to turn off. |
The radix tree approach is a relatively simple and memory sanitary alternative to the current (slightly) unsanitary address_in_range(). The radix tree is currently only implemented for 64-bit platforms. Adding a 32-bit version would be relatively easy.
Current 64-bit chips only have 48 bits of physical addresses. Exploit that to further shrink the size of the radix tree.
1e8297a
to
4df4225
Compare
The branch has been updated to add 32-bit support. For 32-bit pointers, two levels of the tree have been eliminated vs the 64-bit version. Now the radix tree is enabled unconditionally for all CPython platforms. If desired, we can restore the I went ahead an also increased the arena and pool sizes by 4x. That's based on advice from Tim Peters. He says:
Tim's big-pool patch is here: https://bugs.python.org/issue37211 I've run pyperformance benchmarks. Without increasing pool and arena sizes, the radix version of obmalloc is 1.01x or 1.02x slower on some of the benchmarks. With the size increases, it is about that much faster. So, I would say basically a wash. |
If disabled, the old version of address_in_range() is used.
Configure option is not required. If some Python user find memory usage regression, they can build Python with |
@nascheme So what's stopping us from landing this? |
Only my own caution, I think. It is easy to back out or disable if we decide it's not good. I have a small cleanup to variable names to make but I can merge after that. |
Should probably get a review from @pablogsal (Pablo, you can also delegate to someone else). |
Hey everyone, wanted to add an extra data point here. We tested this change with the Instagram workload and all the health metrics looked fine (i.e no weird crashes / behavior). Also, performance-wise, this looks like a net-neutral change. |
Thanks a lot @eduardo-elizondo, that will help us a lot when making a review and a final decision. Seems that the main thing we need to consider is maintenance cost and possible semantic change, but I am optimistic about both. I still need to find time to review it in detail. |
One question: does valgrind complains the same way it used to with the old |
Pablo, I never ran valgrind myself, but a primary point of this different approach is that it never references uninitialized memory (well, assuming users don't pass insane addresses to |
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit b41a77a 🤖 If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again. |
|
Buildbot failure on S390x seems to be unrelated to this change. See https://bugs.python.org/issue37368 . |
/* number of bits in a pointer */ | ||
#define POINTER_BITS 64 | ||
|
||
/* Current 64-bit processors are limited to 48-bit physical addresses. For |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're talking about virtual addresses here, Python does not see physical addresses unless run bare-metal.
But that's not true, you're only thinking of x86, but even then LA57 support is a thing that's been floating around for half a decade now. AArch64, depending on hardware support and configuration, can go up to 52-bit addresses. Some 64-bit SPARC machines have full 64-bit virtual address space support (though the 4-level Linux kernel implementation can only use up to 53). Itanium is complicated; it seems to use up to 44 of the low bits, at least on this specific machine, but the top three bits are also significant, as they are a region number, within which the other bits are interpreted (i.e. you get 8 separate address spaces you can configure differently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(RISC-V also has only Sv39 and Sv48 currently but Sv57 is on track to be ratified very soon)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds like we need to make the comment more accurate and add some extra ifdefs for these machines. The easiest fix is to set PYMALLOC_RADIXTREE=0
for those machines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or perhaps better to have the throwing-bits-away version opt-in rather than opt-out, so some architecture you've never heard of, or doesn't yet exist, works, albeit perhaps a little less efficiently, rather than breaking?
|
||
#define MAP_TOP_BITS INTERIOR_BITS | ||
#define MAP_TOP_LENGTH (1 << MAP_TOP_BITS) | ||
#define MAP_TOP_MASK (MAP_BOT_LENGTH - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAP_TOP_LENGTH?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup it's a bug, it should be TOP not BOT. On 64-bit machines, MAP_BOT_LENGTH is 256 and MAP_TOP_LENGTH is 1024.
…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Jira: ENTLLT-7122 Change-Id: Ifbc8c1290077b78c85ac30abe0bcbb7c8ea0b959
…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Change-Id: I0a3c2979c207f997c707c5f798941426c8d50004
GH-113737 is a backport of this pull request to the 3.9 branch. |
1 similar comment
GH-113737 is a backport of this pull request to the 3.9 branch. |
The radix tree approach is a relatively simple and memory sanitary
alternative to the current (slightly) unsanitary address_in_range().
The radix tree is currently only implemented for 64-bit platforms.
Adding a 32-bit version would be relatively easy.
https://bugs.python.org/issue37448