Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tagged integers #676

Open
markshannon opened this issue May 8, 2024 · 6 comments
Open

Tagged integers #676

markshannon opened this issue May 8, 2024 · 6 comments

Comments

@markshannon
Copy link
Member

markshannon commented May 8, 2024

With deferred reference counting, we gain some freedom in how we represent references to objects.
One common optimization in VMs, from early Smalltalk days to V8 and Javascript Core, is tagged pointers.
The idea is to use one or more of the spare bits in a pointer to distinguish pointers from other values.

Given how ubiquitous integers are, they are the obvious candidate for tagging.
Tagging floats is also appealing, but is probably too awkward to be viable, but here's an idea on how it might be done

Given that we plan to convert stack references from PyObject * to an opaque struct anyway, adding tagged ints becomes considerably easier.

Stack and heap references

Converting from a reference struct to a PyObject * is potentially expensive if the struct contains a tagged int, as we need to allocate and free boxed ints. To reduce the cost we want to minimize the conversions. That means that most collection objects should contain reference structs, not PyObject *s. We don't need to change them all at once though.

Ideally, the only conversions would occur at the boundary between the VM and third party code. Although even that cost can be be largely eliminated

The tagging scheme

The obvious scheme is to set the low bit to 1 for unboxed ints, storing the actual value in the remaining bits.
It actually makes more sense, however to tag pointers and not ints for a few reasons.

  1. It means that we don't need to offset the int by 1 when performing arithmetic
  2. The cost of the offset for pointers is effectively zero, as almost all pointer dereferences include a constant offset. Changing the constant offset by one is free
  3. It help finds bugs. Casting a tagged pointer to an untagged pointer is incorrect, and by tagging pointers will almost certainly cause a crash

If we want to reserve a second bit for tagging other values, or maybe marking deferred references, then we have the following tagging scheme:

Tag Meaning
00 Unboxed int
01 Reserved
10 Reserved
11 Tagged pointer (value = ptr-1)
@gvanrossum
Copy link
Collaborator

See also Ken Jin's thoughts on tagging, in a Google Doc and a PR (python/cpython#118450).

@Fidget-Spinner
Copy link
Collaborator

The tagging scheme used in that PR is compatible with this. It needs very few modifications to tag everything to 0b11 like what this issue is suggesting.

@gvanrossum
Copy link
Collaborator

Tagged pointer (value = ptr-1)

Shouldn't that be "(value = ptr-3)"? Or am I missing something?

@markshannon
Copy link
Member Author

Tagged pointer (value = ptr-1)

Shouldn't that be "(value = ptr-3)"? Or am I missing something?

It is ptr-1: xxxx00 - 1 == yyyy11, where yyyy == xxxx - 1

@aivarsk
Copy link

aivarsk commented Sep 9, 2024

I found this https://coredumped.dev/2024/09/09/what-is-the-best-pointer-tagging-method/ article today, the lower byte tagging is interesting.

@markshannon
Copy link
Member Author

markshannon commented Dec 6, 2024

We are now using tagged references on the stack for the free-threaded build and I expect to be using them in normal builds soon.

For stack references we have:

Tag Meaning
00 Normal pointers
01 Pointers with embedded reference count
10 Unused
11 Pointer to immortal object (including NULL)

Heap references cannot have embedded reference count (we wouldn't be able to find them), so for heap references we would have:

Tag Meaning
00 Normal pointers
01 Unused, but not usable
10 Unused
11 Pointer to immortal object (including NULL)

Tagged integers behave like immortal objects. They do not have lifetimes and do not have a changeable reference count.
Which means to avoid having to redesign the whole tagging scheme and loose its desirable properties, we need the low two bits for tagged ints to be 11 as well, so we need another tag bit:

Tag Meaning
000 Normal pointers
001 Stack only: embedded reference count
010 Unused
011 Pointer to immortal object (including NULL)
100 Unused
101 Unused, not usable
110 Unused
111 Tagged integer

This is fine for 64 bit machines.
It should be OK for 32 bit machines because all allocations should be two-word aligned, but I'm not 100% sure it would work in practice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants