Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-37448: Add radix tree implementation for obmalloc address_in_range(). #14474

Merged
merged 17 commits into from
Mar 30, 2021

Conversation

nascheme
Copy link
Member

@nascheme nascheme commented Jun 29, 2019

The radix tree approach is a relatively simple and memory sanitary
alternative to the current (slightly) unsanitary address_in_range().
The radix tree is currently only implemented for 64-bit platforms.
Adding a 32-bit version would be relatively easy.

https://bugs.python.org/issue37448

Objects/obmalloc.c Outdated Show resolved Hide resolved
Objects/obmalloc.c Outdated Show resolved Hide resolved
@tim-one
Copy link
Member

tim-one commented Jul 1, 2019

Food for thought: consider making L1 fatter? My understanding is that Intel and AMD chips are limited to at most 48 physical address bits now, and they agreed that, for now, the top 16 bits of virtual addresses must be copies of bit 2**47. Which was intentional, to stop "clever" OS authors and users from using the higher-order bits for something else, thus making future expansion of physical address bits essentially impossible. Jerks - we could have used the sign bit to mean "address in range" 😉.

So the top 17 bits will all be 0 or all be 1, and any L1 width <= 17 will never see more than those 2 values. On Windows, I expect they'll only see "all 0", because the OS reserves the "all 1" patterns for its own use - don't know about Linux.

I bet the next step will be 56-bit physical addresses.

@tim-one
Copy link
Member

tim-one commented Jul 2, 2019

Note that on a 32-bit box, with 256K arenas there are only 2**(32-18) = 2**14 = 16K ideal arena addresses. So a static 1-level tree would be fine.

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2019

Regarding making the root node larger, that sounds like a good idea. I'm not sure how large is good though. It increases the virtual set size of the process. I compared VSS and RSS of the base Python (current address_in_range()), small root radix node (MAP1_BITS = 16) and MAP1_BITS = 20, using a program that does basically nothing.

Program RSS [MiB] VSS [MiB]
Base 11.4 12.9
16 bits 11.4 14.0
20 bits 11.4 21.2

Below is my quick and dirty change to use 20 bits for the root node.

I'm hesitant to increase the VSS. On Linux it is not much of a problem because Linux typically is very willing to overcommit on RAM. However, other operating systems might be different. E.g. I think FreeBSD is more conservative. Not sure about Windows. The idea is to avoid having to kill programs due to an out-of-memory situation. If they don't overcommit, they will refuse to start new programs if the VSS is too large. At least, that's my understanding.

--- a/Objects/obmalloc.c
+++ b/Objects/obmalloc.c
@@ -2831,20 +2831,27 @@ _PyObject_DebugMallocStats(FILE *out)
 #   error "arena size must be < 2^32"
 #endif
 
-/* bits used for MAP1 and MAP2 nodes */
-#define INTERIOR_BITS ((BITS - ARENA_BITS + 2) / 3)
-
-#define MAP1_BITS INTERIOR_BITS
+/* Current 64-bit processors are limited to 48-bit physical addresses.  For
+ * now, the top 17 bits of addresses will all be equal to bit 2**47.  That is
+ * one reason to make MAP1_LENGTH large.  Another reason is that we initialize
+ * the root node in the BSS.  For modern operating systems with virtual memory,
+ * it will only consume actual RAM for the OS pages that are touched.
+ */
+#define MAP1_BITS 20
 #define MAP1_LENGTH (1 << MAP1_BITS)
 
-#define MAP2_BITS INTERIOR_BITS
+#define MAP2_BITS ((BITS - MAP1_BITS - ARENA_BITS)/2)
 #define MAP2_LENGTH (1 << MAP2_BITS)
 #define MAP2_MASK (MAP2_LENGTH - 1)
 
-#define MAP3_BITS (BITS - ARENA_BITS - 2*INTERIOR_BITS)
+#define MAP3_BITS (BITS - MAP1_BITS - MAP2_BITS - ARENA_BITS)
 #define MAP3_LENGTH (1 << MAP3_BITS)
 #define MAP3_MASK (MAP3_LENGTH - 1)
 
+#if MAP3_BITS < 8
+#   error "MAP3_BITS too small, adjust MAP1_BITS"
+#endif
+
 #define MAP3_SHIFT ARENA_BITS
 #define MAP2_SHIFT (MAP3_BITS + MAP3_SHIFT)
 #define MAP1_SHIFT (MAP2_BITS + MAP2_SHIFT)

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2019

Regarding a 32-bit version, I'm not sure what to do there. We should at least do some benchmarking on common 32-bit platforms. I would guess most common would be 32-bit Windows on an amd64 processor or Linux on a 32-bit ARM processor. In those cases, maybe we could keep POOL_SIZE <= 4096 and keep the existing address_in_range()?

@tim-one
Copy link
Member

tim-one commented Jul 3, 2019

I had in mind more widening L1 by narrowing it ;-) That is, throw away the top 16 address bits entirely (in debug mode verifying they're the same as the top bit that remains). Then we only have 48 - 18 = 30 arena prefix bits for the tree levels to resolve.

Yes, that's aggressive. But then, best I can tell, the entire useful effect of widening L1 to 20 bits as-is could be gotten by shrinking it to 4 bits instead. Of course the tree would have to be reworked a bit when physical address bits increase. Keep it 3 levels, and perhaps it could just be a matter of changing a PHYSICAL_ADDRESS_BITS #define.

FYI, there is no overcommit on Windows, and no OOM killer. While doing a "commit" with VirtualAlloc doesn't assign physical RAM (same as Linux, that's assigned as pages are later touched), the amount is charged against a system-wide total of physical RAM + swapfile size. The commit fails if that total would be exceeded.

So increasing (the moral equivalent of) VSS there is even less atractive. But there's something to be said for writing code that exploits hardware realities: there are only 48 address bits 😉. Then VSS can shrink.

@nascheme
Copy link
Member Author

nascheme commented Jul 3, 2019

So increasing (the moral equivalent of) VSS there is even less atractive. But there's something to be said for writing code that exploits hardware realities: there are only 48 address bits wink. Then VSS can shrink

Ah, that is very pragmatic. Thank you for explaining it to me. ;-P

I have made a patch and done some testing. Here is an updated table. Obviously the radix tree is tiny in the case that it only uses 30 bits. I have root node using 10 bits.

Program RSS [MiB] VSS [MiB]
Base 11.4 12.9
16 bits 11.4 14.0
20 bits 11.4 21.2
10 bits 11.2 12.9

@tim-one
Copy link
Member

tim-one commented Jul 4, 2019

Cool! OTOH ... we're making much stronger assumptions now, and it's unclear how long they'll last. Or even if they all apply even now. For example, I know that the 64-bit PowerPC architecture still lives on in some very high-end systems, but don't know whether Python is used on them, or what their virtual address conventions are.

FYI, looks like "the next step" in x64-land will be 57-bit physical addresses rather than 56:

https://en.wikipedia.org/wiki/Intel_5-level_paging

And Linux has apparently already been patched for that, but in anticipation - no chips can use it yet.

Prudent: yes, leave 32-bit Python entirely alone. But also make it very easy to revert 64-bit Python to the status quo too (e.g., if PY38OBMALLOC is defined).

Objects/obmalloc.c Outdated Show resolved Hide resolved
@nascheme
Copy link
Member Author

nascheme commented Jul 4, 2019

Cool! OTOH ... we're making much stronger assumptions now, and it's unclear how long they'll last. Or even if they all apply even now. For example, I know that the 64-bit PowerPC architecture still lives on in some very high-end systems, but don't know whether Python is used on them, or what their virtual address conventions are.

I found this recent bug fix:

powerpc/mm/64s/hash: Reallocate context ids on fork

I think that means with certain setups, user mode programs on POWER9 processors can see addresses in the 4 PB range (52 bits?).

FYI, looks like "the next step" in x64-land will be 57-bit physical addresses rather than 56:

https://en.wikipedia.org/wiki/Intel_5-level_paging

And Linux has apparently already been patched for that, but in anticipation - no chips can use it yet.

Yes, the patch is in the current release:
https://www.kernel.org/doc/html/latest/x86/x86_64/5level-paging.html

You will be amused to read that you are not the first to think of hiding flags in the high bits of pointers.

Prudent: yes, leave 32-bit Python entirely alone. But also make it very easy to revert 64-bit Python to the status quo too (e.g., if PY38OBMALLOC is defined).

There is a define for that: WITH_RADIX_TREE. Easy to turn off.

@brettcannon brettcannon added performance Performance or resource usage type-feature A feature request or enhancement labels Jul 5, 2019
The radix tree approach is a relatively simple and memory sanitary
alternative to the current (slightly) unsanitary address_in_range().
The radix tree is currently only implemented for 64-bit platforms.
Adding a 32-bit version would be relatively easy.
Current 64-bit chips only have 48 bits of physical addresses.
Exploit that to further shrink the size of the radix tree.
@nascheme
Copy link
Member Author

The branch has been updated to add 32-bit support. For 32-bit pointers, two levels of the tree have been eliminated vs the 64-bit version.

Now the radix tree is enabled unconditionally for all CPython platforms. If desired, we can restore the WITH_RADIX_TREE conditional and leave the old address_in_range() logic in there. However, I think it's cleaner just to remove it. The only reason to keep it is if there is some platform that the radix tree doesn't work on or it is (materially) slower than the radix tree version.

I went ahead an also increased the arena and pool sizes by 4x. That's based on advice from Tim Peters. He says:

I'm all in favor of merging this if it included, say, quadrupling
arena and pool sizes. Then the radix tree is a much better technical
approach then the "big pool" hack I coded.

Tim's big-pool patch is here: https://bugs.python.org/issue37211

I've run pyperformance benchmarks. Without increasing pool and arena sizes, the radix version of obmalloc is 1.01x or 1.02x slower on some of the benchmarks. With the size increases, it is about that much faster. So, I would say basically a wash.

If disabled, the old version of address_in_range() is used.
@methane
Copy link
Member

methane commented Feb 25, 2021

It's a bit more code to leave the old version in but not too much. I can do that. Should I also add a "configure" command line option? E.g. --disable-obmalloc-radix-tree.

Configure option is not required. If some Python user find memory usage regression, they can build Python with make -D PYMALLOC_RADIXTREE=0.

@gvanrossum
Copy link
Member

@nascheme So what's stopping us from landing this?

@nascheme
Copy link
Member Author

nascheme commented Mar 3, 2021

Only my own caution, I think. It is easy to back out or disable if we decide it's not good. I have a small cleanup to variable names to make but I can merge after that.

@gvanrossum
Copy link
Member

Should probably get a review from @pablogsal (Pablo, you can also delegate to someone else).

@eduardo-elizondo
Copy link
Contributor

Hey everyone, wanted to add an extra data point here. We tested this change with the Instagram workload and all the health metrics looked fine (i.e no weird crashes / behavior). Also, performance-wise, this looks like a net-neutral change.

@pablogsal
Copy link
Member

Hey everyone, wanted to add an extra data point here. We tested this change with the Instagram workload and all the health metrics looked fine (i.e no weird crashes / behavior). Also, performance-wise, this looks like a net-neutral change.

Thanks a lot @eduardo-elizondo, that will help us a lot when making a review and a final decision. Seems that the main thing we need to consider is maintenance cost and possible semantic change, but I am optimistic about both. I still need to find time to review it in detail.

@pablogsal
Copy link
Member

One question: does valgrind complains the same way it used to with the old address_in_range() with this change? Is this situation better with the radix tree?

@tim-one
Copy link
Member

tim-one commented Mar 11, 2021

Pablo, I never ran valgrind myself, but a primary point of this different approach is that it never references uninitialized memory (well, assuming users don't pass insane addresses to free()/realloc()). The current approach does, which is why valgrind legitimately (but spuriously) complains. valgrind "should be" 100% happy with what the PR does instead.

@tim-one tim-one closed this Mar 11, 2021
@tim-one tim-one reopened this Mar 11, 2021
@pablogsal pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Mar 11, 2021
@bedevere-bot
Copy link

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit b41a77a 🤖

If you want to schedule another build, you need to add the ":hammer: test-with-buildbots" label again.

@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Mar 11, 2021
@nascheme nascheme merged commit 85b6b70 into python:master Mar 30, 2021
@bedevere-bot
Copy link

⚠️⚠️⚠️ Buildbot failure ⚠️⚠️⚠️

Hi! The buildbot s390x RHEL8 3.x has failed when building commit 85b6b70.

What do you need to do:

  1. Don't panic.
  2. Check the buildbot page in the devguide if you don't know what the buildbots are or how they work.
  3. Go to the page of the buildbot that failed (https://buildbot.python.org/all/#builders/509/builds/926) and take a look at the build logs.
  4. Check if the failure is related to this commit (85b6b70) or if it is a false positive.
  5. If the failure is related to this commit, please, reflect that on the issue and make a new Pull Request with a fix.

You can take a look at the buildbot page here:

https://buildbot.python.org/all/#builders/509/builds/926

Summary of the results of the build (if available):

== Tests result: ENV CHANGED ==

412 tests OK.

10 slowest tests:

  • test_concurrent_futures: 3 min 5 sec
  • test_gdb: 1 min 24 sec
  • test_multiprocessing_spawn: 1 min 24 sec
  • test_peg_generator: 1 min 17 sec
  • test_capi: 1 min 14 sec
  • test_unparse: 1 min 9 sec
  • test_multiprocessing_forkserver: 1 min 7 sec
  • test_multiprocessing_fork: 1 min 2 sec
  • test_tokenize: 56.0 sec
  • test_asyncio: 54.1 sec

1 test altered the execution environment:
test_asyncio

14 tests skipped:
test_devpoll test_ioctl test_kqueue test_msilib test_nis
test_ossaudiodev test_startfile test_tix test_tk test_ttk_guionly
test_winconsoleio test_winreg test_winsound test_zipfile64

Total duration: 5 min 35 sec

Click to see traceback logs
Traceback (most recent call last):
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 321, in __del__
    self.close()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 316, in close
    self._ssl_protocol._start_shutdown()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 590, in _start_shutdown
    self._abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/sslproto.py", line 731, in _abort
    self._transport.abort()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/selector_events.py", line 680, in abort
    self._force_close(None)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/selector_events.py", line 731, in _force_close
    self._loop.call_soon(self._call_connection_lost, exc)
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/base_events.py", line 745, in call_soon
    self._check_closed()
  File "/home/dje/cpython-buildarea/3.x.edelsohn-rhel8-z/build/Lib/asyncio/base_events.py", line 510, in _check_closed
    raise RuntimeError('Event loop is closed')
RuntimeError: Event loop is closed

@nascheme
Copy link
Member Author

Buildbot failure on S390x seems to be unrelated to this change. See https://bugs.python.org/issue37368 .

/* number of bits in a pointer */
#define POINTER_BITS 64

/* Current 64-bit processors are limited to 48-bit physical addresses. For
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're talking about virtual addresses here, Python does not see physical addresses unless run bare-metal.

But that's not true, you're only thinking of x86, but even then LA57 support is a thing that's been floating around for half a decade now. AArch64, depending on hardware support and configuration, can go up to 52-bit addresses. Some 64-bit SPARC machines have full 64-bit virtual address space support (though the 4-level Linux kernel implementation can only use up to 53). Itanium is complicated; it seems to use up to 44 of the low bits, at least on this specific machine, but the top three bits are also significant, as they are a region number, within which the other bits are interpreted (i.e. you get 8 separate address spaces you can configure differently).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(RISC-V also has only Sv39 and Sv48 currently but Sv57 is on track to be ratified very soon)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds like we need to make the comment more accurate and add some extra ifdefs for these machines. The easiest fix is to set PYMALLOC_RADIXTREE=0 for those machines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or perhaps better to have the throwing-bits-away version opt-in rather than opt-out, so some architecture you've never heard of, or doesn't yet exist, works, albeit perhaps a little less efficiently, rather than breaking?


#define MAP_TOP_BITS INTERIOR_BITS
#define MAP_TOP_LENGTH (1 << MAP_TOP_BITS)
#define MAP_TOP_MASK (MAP_BOT_LENGTH - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAP_TOP_LENGTH?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup it's a bug, it should be TOP not BOT. On 64-bit machines, MAP_BOT_LENGTH is 256 and MAP_TOP_LENGTH is 1024.

diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Jan 5, 2024
…n_range(). (pythonGH-14474)

This is a backport of the radix tree implementation for pymalloc.
It is not platform specific but it solves a segmentation fault on
aarch64 platforms when MTE (Memory Tag Extension) is enabled.
The github issue is pythongh-87759.

Original commit message:
The radix tree approach is a relatively simple and memory sanitary
alternative to the old (slightly) unsanitary address_in_range().
To disable the radix tree map, set a preprocessor flag as follows:
-DWITH_PYMALLOC_RADIX_TREE=0.

(cherry picked from commit 85b6b70)

Co-authored-by: Tim Peters <tim.peters@gmail.com>
Jira: ENTLLT-7122
Change-Id: Ifbc8c1290077b78c85ac30abe0bcbb7c8ea0b959
diegorusso pushed a commit to diegorusso/cpython that referenced this pull request Jan 5, 2024
…n_range(). (pythonGH-14474)

This is a backport of the radix tree implementation for pymalloc.
It is not platform specific but it solves a segmentation fault on
aarch64 platforms when MTE (Memory Tag Extension) is enabled.
The github issue is pythongh-87759.

Original commit message:
The radix tree approach is a relatively simple and memory sanitary
alternative to the old (slightly) unsanitary address_in_range().
To disable the radix tree map, set a preprocessor flag as follows:
-DWITH_PYMALLOC_RADIX_TREE=0.

(cherry picked from commit 85b6b70)

Co-authored-by: Tim Peters <tim.peters@gmail.com>
Change-Id: I0a3c2979c207f997c707c5f798941426c8d50004
@bedevere-app
Copy link

bedevere-app bot commented Jan 5, 2024

GH-113737 is a backport of this pull request to the 3.9 branch.

1 similar comment
@bedevere-app
Copy link

bedevere-app bot commented Jan 5, 2024

GH-113737 is a backport of this pull request to the 3.9 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage type-feature A feature request or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.