bpo-37448: Add radix tree implementation for obmalloc address_in_range(). #14474

nascheme · 2019-06-29T23:13:53Z

The radix tree approach is a relatively simple and memory sanitary
alternative to the current (slightly) unsanitary address_in_range().
The radix tree is currently only implemented for 64-bit platforms.
Adding a 32-bit version would be relatively easy.

https://bugs.python.org/issue37448

Objects/obmalloc.c

tim-one · 2019-07-01T03:47:24Z

Food for thought: consider making L1 fatter? My understanding is that Intel and AMD chips are limited to at most 48 physical address bits now, and they agreed that, for now, the top 16 bits of virtual addresses must be copies of bit 2**47. Which was intentional, to stop "clever" OS authors and users from using the higher-order bits for something else, thus making future expansion of physical address bits essentially impossible. Jerks - we could have used the sign bit to mean "address in range" 😉.

So the top 17 bits will all be 0 or all be 1, and any L1 width <= 17 will never see more than those 2 values. On Windows, I expect they'll only see "all 0", because the OS reserves the "all 1" patterns for its own use - don't know about Linux.

I bet the next step will be 56-bit physical addresses.

tim-one · 2019-07-02T16:56:32Z

Note that on a 32-bit box, with 256K arenas there are only 2**(32-18) = 2**14 = 16K ideal arena addresses. So a static 1-level tree would be fine.

nascheme · 2019-07-03T02:02:37Z

Regarding making the root node larger, that sounds like a good idea. I'm not sure how large is good though. It increases the virtual set size of the process. I compared VSS and RSS of the base Python (current address_in_range()), small root radix node (MAP1_BITS = 16) and MAP1_BITS = 20, using a program that does basically nothing.

Program	RSS [MiB]	VSS [MiB]
Base	11.4	12.9
16 bits	11.4	14.0
20 bits	11.4	21.2

Below is my quick and dirty change to use 20 bits for the root node.

I'm hesitant to increase the VSS. On Linux it is not much of a problem because Linux typically is very willing to overcommit on RAM. However, other operating systems might be different. E.g. I think FreeBSD is more conservative. Not sure about Windows. The idea is to avoid having to kill programs due to an out-of-memory situation. If they don't overcommit, they will refuse to start new programs if the VSS is too large. At least, that's my understanding.

--- a/Objects/obmalloc.c
+++ b/Objects/obmalloc.c
@@ -2831,20 +2831,27 @@ _PyObject_DebugMallocStats(FILE *out)
 #   error "arena size must be < 2^32"
 #endif
 
-/* bits used for MAP1 and MAP2 nodes */
-#define INTERIOR_BITS ((BITS - ARENA_BITS + 2) / 3)
-
-#define MAP1_BITS INTERIOR_BITS
+/* Current 64-bit processors are limited to 48-bit physical addresses.  For
+ * now, the top 17 bits of addresses will all be equal to bit 2**47.  That is
+ * one reason to make MAP1_LENGTH large.  Another reason is that we initialize
+ * the root node in the BSS.  For modern operating systems with virtual memory,
+ * it will only consume actual RAM for the OS pages that are touched.
+ */
+#define MAP1_BITS 20
 #define MAP1_LENGTH (1 << MAP1_BITS)
 
-#define MAP2_BITS INTERIOR_BITS
+#define MAP2_BITS ((BITS - MAP1_BITS - ARENA_BITS)/2)
 #define MAP2_LENGTH (1 << MAP2_BITS)
 #define MAP2_MASK (MAP2_LENGTH - 1)
 
-#define MAP3_BITS (BITS - ARENA_BITS - 2*INTERIOR_BITS)
+#define MAP3_BITS (BITS - MAP1_BITS - MAP2_BITS - ARENA_BITS)
 #define MAP3_LENGTH (1 << MAP3_BITS)
 #define MAP3_MASK (MAP3_LENGTH - 1)
 
+#if MAP3_BITS < 8
+#   error "MAP3_BITS too small, adjust MAP1_BITS"
+#endif
+
 #define MAP3_SHIFT ARENA_BITS
 #define MAP2_SHIFT (MAP3_BITS + MAP3_SHIFT)
 #define MAP1_SHIFT (MAP2_BITS + MAP2_SHIFT)

nascheme · 2019-07-03T02:08:03Z

Regarding a 32-bit version, I'm not sure what to do there. We should at least do some benchmarking on common 32-bit platforms. I would guess most common would be 32-bit Windows on an amd64 processor or Linux on a 32-bit ARM processor. In those cases, maybe we could keep POOL_SIZE <= 4096 and keep the existing address_in_range()?

tim-one · 2019-07-03T03:38:20Z

I had in mind more widening L1 by narrowing it ;-) That is, throw away the top 16 address bits entirely (in debug mode verifying they're the same as the top bit that remains). Then we only have 48 - 18 = 30 arena prefix bits for the tree levels to resolve.

Yes, that's aggressive. But then, best I can tell, the entire useful effect of widening L1 to 20 bits as-is could be gotten by shrinking it to 4 bits instead. Of course the tree would have to be reworked a bit when physical address bits increase. Keep it 3 levels, and perhaps it could just be a matter of changing a PHYSICAL_ADDRESS_BITS #define.

FYI, there is no overcommit on Windows, and no OOM killer. While doing a "commit" with VirtualAlloc doesn't assign physical RAM (same as Linux, that's assigned as pages are later touched), the amount is charged against a system-wide total of physical RAM + swapfile size. The commit fails if that total would be exceeded.

So increasing (the moral equivalent of) VSS there is even less atractive. But there's something to be said for writing code that exploits hardware realities: there are only 48 address bits 😉. Then VSS can shrink.

nascheme · 2019-07-03T05:34:30Z

So increasing (the moral equivalent of) VSS there is even less atractive. But there's something to be said for writing code that exploits hardware realities: there are only 48 address bits wink. Then VSS can shrink

Ah, that is very pragmatic. Thank you for explaining it to me. ;-P

I have made a patch and done some testing. Here is an updated table. Obviously the radix tree is tiny in the case that it only uses 30 bits. I have root node using 10 bits.

Program	RSS [MiB]	VSS [MiB]
Base	11.4	12.9
16 bits	11.4	14.0
20 bits	11.4	21.2
10 bits	11.2	12.9

tim-one · 2019-07-04T02:41:38Z

Cool! OTOH ... we're making much stronger assumptions now, and it's unclear how long they'll last. Or even if they all apply even now. For example, I know that the 64-bit PowerPC architecture still lives on in some very high-end systems, but don't know whether Python is used on them, or what their virtual address conventions are.

FYI, looks like "the next step" in x64-land will be 57-bit physical addresses rather than 56:

https://en.wikipedia.org/wiki/Intel_5-level_paging

And Linux has apparently already been patched for that, but in anticipation - no chips can use it yet.

Prudent: yes, leave 32-bit Python entirely alone. But also make it very easy to revert 64-bit Python to the status quo too (e.g., if PY38OBMALLOC is defined).

Objects/obmalloc.c

nascheme · 2019-07-04T03:55:35Z

Cool! OTOH ... we're making much stronger assumptions now, and it's unclear how long they'll last. Or even if they all apply even now. For example, I know that the 64-bit PowerPC architecture still lives on in some very high-end systems, but don't know whether Python is used on them, or what their virtual address conventions are.

I found this recent bug fix:

powerpc/mm/64s/hash: Reallocate context ids on fork

I think that means with certain setups, user mode programs on POWER9 processors can see addresses in the 4 PB range (52 bits?).

FYI, looks like "the next step" in x64-land will be 57-bit physical addresses rather than 56:

https://en.wikipedia.org/wiki/Intel_5-level_paging

And Linux has apparently already been patched for that, but in anticipation - no chips can use it yet.

Yes, the patch is in the current release:
https://www.kernel.org/doc/html/latest/x86/x86_64/5level-paging.html

You will be amused to read that you are not the first to think of hiding flags in the high bits of pointers.

Prudent: yes, leave 32-bit Python entirely alone. But also make it very easy to revert 64-bit Python to the status quo too (e.g., if PY38OBMALLOC is defined).

There is a define for that: WITH_RADIX_TREE. Easy to turn off.

The radix tree approach is a relatively simple and memory sanitary alternative to the current (slightly) unsanitary address_in_range(). The radix tree is currently only implemented for 64-bit platforms. Adding a 32-bit version would be relatively easy.

Current 64-bit chips only have 48 bits of physical addresses. Exploit that to further shrink the size of the radix tree.

nascheme · 2021-02-21T23:22:46Z

The branch has been updated to add 32-bit support. For 32-bit pointers, two levels of the tree have been eliminated vs the 64-bit version.

Now the radix tree is enabled unconditionally for all CPython platforms. If desired, we can restore the WITH_RADIX_TREE conditional and leave the old address_in_range() logic in there. However, I think it's cleaner just to remove it. The only reason to keep it is if there is some platform that the radix tree doesn't work on or it is (materially) slower than the radix tree version.

I went ahead an also increased the arena and pool sizes by 4x. That's based on advice from Tim Peters. He says:

I'm all in favor of merging this if it included, say, quadrupling
arena and pool sizes. Then the radix tree is a much better technical
approach then the "big pool" hack I coded.

Tim's big-pool patch is here: https://bugs.python.org/issue37211

I've run pyperformance benchmarks. Without increasing pool and arena sizes, the radix version of obmalloc is 1.01x or 1.02x slower on some of the benchmarks. With the size increases, it is about that much faster. So, I would say basically a wash.

If disabled, the old version of address_in_range() is used.

methane · 2021-02-25T00:56:52Z

It's a bit more code to leave the old version in but not too much. I can do that. Should I also add a "configure" command line option? E.g. --disable-obmalloc-radix-tree.

Configure option is not required. If some Python user find memory usage regression, they can build Python with make -D PYMALLOC_RADIXTREE=0.

gvanrossum · 2021-03-03T00:19:07Z

@nascheme So what's stopping us from landing this?

nascheme · 2021-03-03T00:23:33Z

Only my own caution, I think. It is easy to back out or disable if we decide it's not good. I have a small cleanup to variable names to make but I can merge after that.

gvanrossum · 2021-03-03T00:25:16Z

Should probably get a review from @pablogsal (Pablo, you can also delegate to someone else).

eduardo-elizondo · 2021-03-11T15:04:10Z

Hey everyone, wanted to add an extra data point here. We tested this change with the Instagram workload and all the health metrics looked fine (i.e no weird crashes / behavior). Also, performance-wise, this looks like a net-neutral change.

pablogsal · 2021-03-11T16:58:40Z

Hey everyone, wanted to add an extra data point here. We tested this change with the Instagram workload and all the health metrics looked fine (i.e no weird crashes / behavior). Also, performance-wise, this looks like a net-neutral change.

Thanks a lot @eduardo-elizondo, that will help us a lot when making a review and a final decision. Seems that the main thing we need to consider is maintenance cost and possible semantic change, but I am optimistic about both. I still need to find time to review it in detail.

pablogsal · 2021-03-11T16:59:38Z

One question: does valgrind complains the same way it used to with the old address_in_range() with this change? Is this situation better with the radix tree?

tim-one · 2021-03-11T17:10:33Z

Pablo, I never ran valgrind myself, but a primary point of this different approach is that it never references uninitialized memory (well, assuming users don't pass insane addresses to free()/realloc()). The current approach does, which is why valgrind legitimately (but spuriously) complains. valgrind "should be" 100% happy with what the PR does instead.

bedevere-bot · 2021-03-11T22:54:31Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit b41a77a 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

bedevere-bot · 2021-03-30T02:59:14Z

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x RHEL8 3.x has failed when building commit 85b6b70.

What do you need to do:

Don't panic.
Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/509/builds/926) and take a look at the build logs.
Check if the failure is related to this commit (85b6b70) or if it is a false positive.
If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/509/builds/926

Summary of the results of the build (if available):

== Tests result: ENV CHANGED ==

412 tests OK.

10 slowest tests:

test_concurrent_futures: 3 min 5 sec
test_gdb: 1 min 24 sec
test_multiprocessing_spawn: 1 min 24 sec
test_peg_generator: 1 min 17 sec
test_capi: 1 min 14 sec
test_unparse: 1 min 9 sec
test_multiprocessing_forkserver: 1 min 7 sec
test_multiprocessing_fork: 1 min 2 sec
test_tokenize: 56.0 sec
test_asyncio: 54.1 sec

1 test altered the execution environment:
test_asyncio

14 tests skipped:
test_devpoll test_ioctl test_kqueue test_msilib test_nis
test_ossaudiodev test_startfile test_tix test_tk test_ttk_guionly
test_winconsoleio test_winreg test_winsound test_zipfile64

Total duration: 5 min 35 sec

Click to see traceback logs

Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 321, in __del__
    self.close()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 316, in close
    self._ssl_protocol._start_shutdown()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 590, in _start_shutdown
    self._abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 731, in _abort
    self._transport.abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/selector_events.py", line 680, in abort
    self._force_close(None)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/selector_events.py", line 731, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/base_events.py", line 745, in call_soon
    self._check_closed()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

nascheme · 2021-03-30T03:26:38Z

Buildbot failure on S390x seems to be unrelated to this change. See https://bugs.python.org/issue37368 .

jrtc27 · 2021-10-09T13:36:14Z

Objects/obmalloc.c

+/* number of bits in a pointer */
+#define POINTER_BITS 64
+
+/* Current 64-bit processors are limited to 48-bit physical addresses.  For


You're talking about virtual addresses here, Python does not see physical addresses unless run bare-metal.

But that's not true, you're only thinking of x86, but even then LA57 support is a thing that's been floating around for half a decade now. AArch64, depending on hardware support and configuration, can go up to 52-bit addresses. Some 64-bit SPARC machines have full 64-bit virtual address space support (though the 4-level Linux kernel implementation can only use up to 53). Itanium is complicated; it seems to use up to 44 of the low bits, at least on this specific machine, but the top three bits are also significant, as they are a region number, within which the other bits are interpreted (i.e. you get 8 separate address spaces you can configure differently).

(RISC-V also has only Sv39 and Sv48 currently but Sv57 is on track to be ratified very soon)

It sounds like we need to make the comment more accurate and add some extra ifdefs for these machines. The easiest fix is to set PYMALLOC_RADIXTREE=0 for those machines.

Or perhaps better to have the throwing-bits-away version opt-in rather than opt-out, so some architecture you've never heard of, or doesn't yet exist, works, albeit perhaps a little less efficiently, rather than breaking?

methane · 2021-10-19T04:53:16Z

Objects/obmalloc.c

+
+#define MAP_TOP_BITS INTERIOR_BITS
+#define MAP_TOP_LENGTH (1 << MAP_TOP_BITS)
+#define MAP_TOP_MASK (MAP_BOT_LENGTH - 1)


MAP_TOP_LENGTH?

Yup it's a bug, it should be TOP not BOT. On 64-bit machines, MAP_BOT_LENGTH is 256 and MAP_TOP_LENGTH is 1024.

…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Jira: ENTLLT-7122 Change-Id: Ifbc8c1290077b78c85ac30abe0bcbb7c8ea0b959

…n_range(). (pythonGH-14474) This is a backport of the radix tree implementation for pymalloc. It is not platform specific but it solves a segmentation fault on aarch64 platforms when MTE (Memory Tag Extension) is enabled. The github issue is pythongh-87759. Original commit message: The radix tree approach is a relatively simple and memory sanitary alternative to the old (slightly) unsanitary address_in_range(). To disable the radix tree map, set a preprocessor flag as follows: -DWITH_PYMALLOC_RADIX_TREE=0. (cherry picked from commit 85b6b70) Co-authored-by: Tim Peters <tim.peters@gmail.com> Change-Id: I0a3c2979c207f997c707c5f798941426c8d50004

bedevere-app · 2024-01-05T14:30:59Z

GH-113737 is a backport of this pull request to the 3.9 branch.

bedevere-app · 2024-01-05T14:32:48Z

GH-113737 is a backport of this pull request to the 3.9 branch.

bedevere-bot added the awaiting core review label Jun 29, 2019

the-knights-who-say-ni added the CLA signed label Jun 29, 2019

tim-one reviewed Jun 30, 2019

View reviewed changes

Objects/obmalloc.c Outdated Show resolved Hide resolved

tim-one reviewed Jul 1, 2019

View reviewed changes

Objects/obmalloc.c Outdated Show resolved Hide resolved

tim-one reviewed Jul 4, 2019

View reviewed changes

Objects/obmalloc.c Outdated Show resolved Hide resolved

brettcannon added performance Performance or resource usage type-feature A feature request or enhancement labels Jul 5, 2019

nascheme added 12 commits February 21, 2021 14:50

Add News entry.

295ad0b

Improve comment about size of ARENA_BITS.

e6e100b

handle overflow of arena_base_next

c96c215

change arena_next overflow check to an assert

e2d9541

Exploit current hardware physical address bits

d7361d6

Current 64-bit chips only have 48 bits of physical addresses. Exploit that to further shrink the size of the radix tree.

Remove if, assert() is sufficient.

3492ef6

wip: 32-bit version

9b2886f

wip: re-org code

2e025b0

wip: code cleanup, renames, comments

21c1679

wip: increase pool and arena sizes by 4x

c696d20

Update blurb entry.

4df4225

nascheme force-pushed the obmalloc_radix branch from 1e8297a to 4df4225 Compare February 21, 2021 23:06

Use smaller pool and arena sizes for 32-bit.

94ad760

Add USE_RADIX_TREE flag.

509a5c7

If disabled, the old version of address_in_range() is used.

Rename preprocessor flag, make easy to override.

130a7ed

nascheme added 2 commits March 2, 2021 16:34

Minor improvements to names and comments.

e0924e6

Improve sys._debugmallocstats() output.

b41a77a

tim-one closed this Mar 11, 2021

tim-one reopened this Mar 11, 2021

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Mar 11, 2021

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Mar 11, 2021

pablogsal approved these changes Mar 30, 2021

View reviewed changes

nascheme merged commit 85b6b70 into python:master Mar 30, 2021

bedevere-bot removed the awaiting merge label Mar 30, 2021

jrtc27 reviewed Oct 9, 2021

View reviewed changes

methane reviewed Oct 19, 2021

View reviewed changes

mienxiu mentioned this pull request Oct 15, 2023

C-API docs: Clarify the size of arenas #110895

Merged

diegorusso mentioned this pull request Jan 5, 2024

pymalloc is not aware of Memory Tagging Extension (MTE) and crashes #87759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpo-37448: Add radix tree implementation for obmalloc address_in_range(). #14474

bpo-37448: Add radix tree implementation for obmalloc address_in_range(). #14474

nascheme commented Jun 29, 2019 •

edited by bedevere-bot

Loading

tim-one commented Jul 1, 2019

tim-one commented Jul 2, 2019

nascheme commented Jul 3, 2019 •

edited

Loading

nascheme commented Jul 3, 2019

tim-one commented Jul 3, 2019

nascheme commented Jul 3, 2019

tim-one commented Jul 4, 2019

nascheme commented Jul 4, 2019

nascheme commented Feb 21, 2021

methane commented Feb 25, 2021

gvanrossum commented Mar 3, 2021

nascheme commented Mar 3, 2021

gvanrossum commented Mar 3, 2021

eduardo-elizondo commented Mar 11, 2021

pablogsal commented Mar 11, 2021

pablogsal commented Mar 11, 2021

tim-one commented Mar 11, 2021

bedevere-bot commented Mar 11, 2021

bedevere-bot commented Mar 30, 2021

nascheme commented Mar 30, 2021

jrtc27 Oct 9, 2021

jrtc27 Oct 9, 2021

nascheme Oct 10, 2021

jrtc27 Oct 10, 2021

methane Oct 19, 2021

nascheme Oct 19, 2021

bedevere-app bot commented Jan 5, 2024

bedevere-app bot commented Jan 5, 2024

bpo-37448: Add radix tree implementation for obmalloc address_in_range(). #14474

bpo-37448: Add radix tree implementation for obmalloc address_in_range(). #14474

Conversation

nascheme commented Jun 29, 2019 • edited by bedevere-bot Loading

tim-one commented Jul 1, 2019

tim-one commented Jul 2, 2019

nascheme commented Jul 3, 2019 • edited Loading

nascheme commented Jul 3, 2019

tim-one commented Jul 3, 2019

nascheme commented Jul 3, 2019

tim-one commented Jul 4, 2019

nascheme commented Jul 4, 2019

nascheme commented Feb 21, 2021

methane commented Feb 25, 2021

gvanrossum commented Mar 3, 2021

nascheme commented Mar 3, 2021

gvanrossum commented Mar 3, 2021

eduardo-elizondo commented Mar 11, 2021

pablogsal commented Mar 11, 2021

pablogsal commented Mar 11, 2021

tim-one commented Mar 11, 2021

bedevere-bot commented Mar 11, 2021

bedevere-bot commented Mar 30, 2021

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

nascheme commented Mar 30, 2021

jrtc27 Oct 9, 2021

Choose a reason for hiding this comment

jrtc27 Oct 9, 2021

Choose a reason for hiding this comment

nascheme Oct 10, 2021

Choose a reason for hiding this comment

jrtc27 Oct 10, 2021

Choose a reason for hiding this comment

methane Oct 19, 2021

Choose a reason for hiding this comment

nascheme Oct 19, 2021

Choose a reason for hiding this comment

bedevere-app bot commented Jan 5, 2024

bedevere-app bot commented Jan 5, 2024

nascheme commented Jun 29, 2019 •

edited by bedevere-bot

Loading

nascheme commented Jul 3, 2019 •

edited

Loading