Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explicitly distinguish pointer::addr and pointer::expose_addr #95588

Merged
merged 1 commit into from
Apr 5, 2022

Conversation

RalfJung
Copy link
Member

@RalfJung RalfJung commented Apr 2, 2022

@bgeron pointed out that the current docs promise that ptr.addr() and ptr as usize are equivalent. I don't think that is a promise we want to make. (Conceptually, ptr as usize might 'escape' the provenance to enable future usize as ptr casts, but ptr.addr() dertainly does not do that.)

So I propose we word the docs a bit more carefully here. @Gankra what do you think?

@rust-highfive
Copy link
Collaborator

r? @kennytm

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 2, 2022
@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

IMO they should always be equivalent. Under a "real model" if it's deemed necessary for as usize to be messier to let more code compile under it, presumably it should be just as necessary for .addr(). Under strict-provenance, there is no escape hatch and both would lose the info. .addr() "doesn't need to exist" but exists primarily to:

In particular, if you "care" about the subgoal of being able to decouple usize==ptr on CHERI, then the two must be the same, because usize literally isn't big enough to fit provenance info into anymore.

If you don't want to "play along" and keep using usize casts, that's fine, your code is just Less Portable. (edit: and you get less "nice things" like "perfect miri asan")

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

Under a "real model" if it's deemed necessary for as usize to be messier to let more code compile under it, presumably it should be just as necessary for .addr().

I don't think so. In my view, addr() is specifically for code that promises to use with_addr for casting back this address to a pointer. Legacy code does not use addr() so there is no reason to make addr() allow such casts.

Furthermore, by disallowing such casts, some tools (not CHERI, but Miri) can do a better job at enforcing correctness in programs that mix strict provenance APIs with ptr2int2ptr roundtrips via as: if your memory was never subject to a ptr2int as cast, Miri can ensure that it will never be accessed by the result of an int2ptr as cast.

If you don't want to "play along" and keep using usize casts, that's fine, your code is just Less Portable. (edit: and you get less "nice things" like "perfect miri asan")

If most of the code does play along but a single crate does not, then I think we should still say that only pointers explicitly cast to ints via as are ever allowed to be cast back.

Under strict-provenance, there is no escape hatch and both would lose the info.

So, my current thinking is that strict provenance is not a toggle in the memory model, it is a subset of the Rust language. As in, full Rust has permissive provenance (and our stability guarantee covers that), but if you want extras like CHERI or Miri you have to get your program into the strict provenance subset.

Under this view, there is no such thing as 'this code does X under strict provenance and Y under permissive provenance'.

For programs that are entirely inside the strict provenance subset, indeed this PR does not change anything. But it matters for programs that are gradually moving into that subset, and still have some ptr2int2ptr roundtrips left in some places. For those, I think we should have the rules as set in this PR.

@digama0
Copy link
Contributor

digama0 commented Apr 2, 2022

The way the zulip discussion has been going, I think there seems to be some consensus that we want weak provenance for the actual rust spec, but with strict provenance as a simple to understand subset that we should aspire to live within, and for which Miri et al can work effectively. In that setting, there are definitely two ptr2int operations here: one "broadcasts" / "escapes" the pointer such that it can be used by a subsequent int2ptr, while the other one just extracts the address bits without escaping the pointer, which means that even if you cast it back to a pointer you just get an invalid pointer.

For strict-provenance conforming code, you want to maximally use the non-escaping version, because you have no intention of ever doing an int2ptr so it's just performance / optimization left on the table to do otherwise. Hence I think ptr.addr() should be non-escaping. On the other hand ptr as usize needs to be usable for int2ptr for backward compatibility, so it needs to be escaping. Arguably (by which I mean "argued already"), we should have another function ptr.escape_addr() to be a synonym for ptr as usize which makes the side effecting behavior more explicit and opens the door for eventual deprecation of ptr as usize altogether.

But in far future rust, I think we will want both, because strict provenance everywhere is not possible in the face of FFI and C code that doesn't observe these rules, even if we ignore all the legacy rust code that will not be updated due to MSRV or laziness.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

See this comment for a more detailed explanation of how I am currently interpreting strict provenance.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

But in far future rust, I think we will want both, because strict provenance everywhere is not possible in the face of FFI and C code that doesn't observe these rules, even if we ignore all the legacy rust code that will not be updated due to MSRV or laziness.

I think FFI is a red herring and strict provenance everywhere is possible. But we have too much legacy code to mandate that and I'd rather not have two memory models (strict and permissive).

But that discussion is orthogonal to this PR so let's continue on Zulip. ;)

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

I should also add that my generic argument for why supporting permissive provenance in the optimizer is easy -- as in, the argument for why we can reasonably guarantee that code with ptr2int2ptr cast roundtrips will keep compiling correctly -- relies on the requirement that is added by this PR. :)

@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

I am pretty uncomfortable with any suggestion in this thread that .addr() should/could have any impact on codegen, and that opting into it would in some sense be "more dangerous". That is wholy incongruent with the design / goals / philosophy of this project.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

Interesting. To me this is required to achieve the full potential of strict provenance. I am not even sure what the benefits of this project would be if we still allow code like I think we definitely should reject code like

let addr = ptr.addr();
let ptr = addr as *const u8;
let _val = *ptr;

EDIT: Hm okay I guess it can still help Miri, but it won't help the optimizer. So see the better arguments below.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

I also think distinguishing addr from escape_addr/broadcast_addr helps a lot in explaining the model of full Rust. It certainly helped a lot in the discussions the last few days.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

As one last point, I think we will have to pessimize optimizations around ptr as usize because there is a global side-effect here that applies even if the result of this cast is unused. It would be a shame if code that used map_addr for pointer tagging would have to pay that cost.

The existence of permissive provenance code in some parts of the program should not come at the cost of optimizations for other parts of the program. I don't think we want a 'global toggle' on a language fork for strict vs permissive provenance. But to achieve this isolation within a single compiled program, we have to distinguish ptr as usize from ptr.addr().

@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

@RalfJung you're "allowed" to as far as the compiler and abstract machine are concerned, but if you want better validation (and targetting CHERI) you need to "do better", and we already have tools that can help you do that. I think it's wholy desirable for older projects to be "pressured" to conform by people running miri and getting an explosion and being annoyed.

If there is clarification to be had in the docs it's that what is currently written is "the strict provenance interpretation of both this operation and as usize" and that if a real model has weaker semantics for as usize then it must necessarily have the same weaker semantics for .addr().

(I must confess I am pessimistic on the ability to define a "true" memory model that everyone would be happy with. Part of my goal is to make more code "boring" to the memory model so that the semantic uncanny valley between "what code thinks can happen" and "what the model actually says" is rarer and conceptually like, double-unsafe-code that people scrutinize and validate with more aggressive means, in the same vein of C code that is "probably UB but works in practice because why would the compiler actually miscompile this".)

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

if you want better validation (and targetting CHERI) you need to "do better", and we already have tools that can help you do that. I think it's wholy desirable for older projects to be "pressured" to conform by people running miri and getting an explosion and being annoyed.

All of this is still true under my proposed change.

Why should we allow programs like that?

(I must confess I am pessimistic on the ability to define a "true" memory model that everyone would be happy with. Part of my goal is to make more code "boring" to the memory model so that the semantic uncanny valley between "what code thinks can happen" and "what the model actually says" is rarer and conceptually like, double-unsafe-code that people scrutinize and validate with more aggressive means, in the same vein of C code that is "probably UB but works in practice because why would the compiler actually miscompile this".)

I have a very concrete vision for defining a memory model that at least a lot of people would be happy with. It would ban ptr2int2ptr transmute roundtrips, and maybe ptr2int transmutes in general, but preserve everything else.

But to make that model coherent, I argue we need the change I am proposing here.

I agree with having more "boring code", and the APIs help make "more code boring". This change does not even affect "boring code", so if you basically gave up already on code outside the 'boring' fragment you should be agnostic on this change.

@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

Maybe I'm off-base here but as far as the optimizer goes, Rust code is already Pretty Dang Fast, and is "doomed" to lower to compiler backends that implement something like PNVI-ae-udi anyway, so the promise of Better Codegen is not something I regard as plausible or interesting.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

I am saying we need to regress codegen for code that uses ptr as usize to achieve correctness. I think LLVM will have to do that if they ever want to actually conform to PNVI-ae-udi + restrict. Certainly I will make sure MIR optimizations conform to that (or the Rust equivalent, with similar effects).

So, what I said about optimizations is not about making anything faster than it is right now. It is about keeping the optimizations that we currently do, and putting them on solid footing. Under the vision I mentioned above, optimizing pointer tagging like we do today is correct if and only if we accept this PR. If we don't accept it, that is almost the equivalent of adding an extern function call around each ptr as usize, usize as ptr -- and each addr() since you insist it is equivalent to ptr as usize.

I am not promising Better Codegen, I am promising Actually Explaining Why Current Codegen Makes Sense (for code that uses strict provenance APIs), and Worse But Correct Codegen (for code that doesn't).

@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

A possible compromise: make it clear that this is making a materially different claim from as usize BUT explicitly note that this is an experimental semantic and that we are wholy willing to shut it off and make the two have identical semantics if this ends up being more trouble than it's worth (because we can always remove UB, but can't add it).

@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

I would also IMO prefer if there was a new method that did have your claimed semantics so that older code could make it clear that it's doing Pointer Crimes and not just completely unaware of the distinction. (and then I guess under CHERI as usize and that other method would maybe be regarded as compile-errors?)

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

explicitly note that this is an experimental semantic and that we are wholy willing to shut it off and make the two have identical semantics if this ends up being more trouble than it's worth (because we can always remove UB, but can't add it).

I thought that was basically implicit already in the fact that this is (a) unstable and (b) references the strict provenance experiment.

I added an even more explicit note, does that work for you?

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

I would also IMO prefer if there was a new method that did have your claimed semantics so that older code could make it clear that it's doing Pointer Crimes and not just completely unaware of the distinction. (and then I guess under CHERI as usize and that other method would maybe be regarded as compile-errors?)

Yes I think there is general consensus that such a method should exist, together with an explicit method to cast an int back to a pointer. Then we can deprecate ptr2int and int2ptr as casts and move everyone to those methods.

@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

So part of the reason I defined everything to have Same As Today semantics is so that:

  • These methods could be on the fast track for stabilization (with the actual strict-provenance parts still unstable).
  • The stable polyfill would work just as well as the "builtin" version, while that version is still unstable.

Like, regardless of models or whatever these are just Nice To Have. .addr() is literally one less character than as usize and composes better for everything other than ref as *mut _ as usize.

@Gankra
Copy link
Contributor

Gankra commented Apr 2, 2022

This has exhausted my remaining spoons to discuss this so I won't be able to review this (and am supposed to be "done").

I've given all the insights I can here. I would prefer the two ops be the same, but if you must then tread carefully because you may scupper the social-engineering of this project. Giving people free toys that are nicer and happen to work better is how you move needles. Making this "more dangerous" makes the toys less free. I also think trying to conceptualize a "mixed mode" miri is dropping the ball on being able to use miri as pressure for code to be become more "boring".

@digama0
Copy link
Contributor

digama0 commented Apr 2, 2022

From the social engineering side, I don't think it's a particularly hard sell.

You are a Good Rustacean, you want to write Good Code. The rust team is now bringing you a new function that both expresses intent better (this is getting the address of a pointer only, not committing Crimes) and also allows the compiler to make your code faster because you expressed your intent. Win win. The looming spectre of UB will only haunt Bad Rustaceans that lie to the compiler and commit Pointer Crimes using this function, and that's not you, right?

@eddyb
Copy link
Member

eddyb commented Apr 2, 2022

I don't have a strong opinion but I feel like we could avoid making promises now without removing the possibility of making them later.

That is, the docs should probably imply neither of .addr() == (as usize) and .addr() != (as usize).

Ideally we'd use maximally-cautionary language that suggests you should pair .addr() with .with_addr, without implying that failing to do so will lead to invalid pointers or any other consequences either way.

Or framed a bit differently, maximally portable "strict provenance" code avoids int2ptr/ptr2int entirely, so the distinction doesn't matter much.
For anything more hybrid we can punt on the degree of interop of .addr() with int2ptr when that model is defined.

Not sure how to describe this in docs exactly though.

Also, @RalfJung, for miri, could .addr() be set up to trigger a warning on int2ptr from it?
Keeping its behavior aligned with as usize but offering additional information to users that they may have meant to pair it with .with_addr.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

So part of the reason I defined everything to have Same As Today semantics is so that:

Since this proposal just adds some UB, those 2 points remain true after my PR. They would for now be library UB, not language UB, but that doesn't seem like a big deal to me for a transition period.

But I think it would be a shame not to use this to also start moving towards a cleaner model while maintaining our current optimization potential (or, rather, to actually unlock the optimization potential that we are currently only having under false premises and "hope it doesn't go wrong"). I thought that was also part of the motivation. Even if it's not at the top of your list, it certainly is a big part of why I got excited about it.

This has exhausted my remaining spoons to discuss this so I won't be able to review this (and am supposed to be "done").

Sorry to hear that. :(

I've given all the insights I can here. I would prefer the two ops be the same, but if you must then tread carefully because you may scupper the social-engineering of this project. Giving people free toys that are nicer and happen to work better is how you move needles. Making this "more dangerous" makes the toys less free. I also think trying to conceptualize a "mixed mode" miri is dropping the ball on being able to use miri as pressure for code to be become more "boring".

I appreciate that. I think there is also some carrots in the other direction though, which is to say that we can promise people that if their code matches this API then we can actually keep optimizing it like we do currently without speculative advances in formal model technology. Basically, we can have our cake (no odd optimization barriers in pointer tagging code) and eat it, too (still supporting code that does ptr2int2ptr cast roundtrips, in a foundationally-formal-correct kind of way), but only if we let the compiler know when some code actually claims to follow strict provenance.

So, my thinking is that this will help make the toy more attractive, overall. But I might be wrong. I wonder what others think.

@digama0

You are a Good Rustacean, you want to write Good Code. The rust team is now bringing you a new function that both expresses intent better (this is getting the address of a pointer only, not committing Crimes) and also allows the compiler to make your code faster because you expressed your intent. Win win. The looming spectre of UB will only haunt Bad Rustaceans that lie to the compiler and commit Pointer Crimes using this function, and that's not you, right?

😂
However I think in the last sentence it should be the "looming spectre of UB or lost optimization potential". Which is probably even worse for some people. ;)
EDIT: Hm, you said "using this function", so I guess it's actually UB.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 2, 2022

@eddyb isn't the forward-compatible option to make it UB? If people end up writing code like the example above, then for the purpose of optimizations and formalizing Rust behavior we will end up in a situation almost like today where we have legacy code we have to support that doesn't give the compiler the information we require for formally correct high-quality codegen.

for miri, could .addr() be set up to trigger a warning on int2ptr from it?

In principle, yes -- but we'd first have to actually implement the concept of 'exposed allocations' in Miri, or even 'exposed provenance' (considering Stacked Borrows). So, not any time soon, I think.

@scottmcm
Copy link
Member

scottmcm commented Apr 2, 2022

I think we need a .bikeshed_addr() if we ever want to make .addr() different from a cast.

Because right now, people are just going to see "oh, there's a method for this" and start using it, probably without checking whether doing so is actually sound. If we can say in the lint (#95488) "hey, make sure you pick carefully between .addr() and .bikeshed_addr()", then at least we have a chance of people getting it right, and have more ammo to actually break code if one or the other actually does become different from as usize. (cc #95583 about pointing people to .addr() if they use the currently-in-nightly .to_bits().)

@arichardson
Copy link

Another way to steer people away from these low-level APIs would be to provide higher level functions such as is_aligned(), align_up(), align_down(), etc. to completely hide the ptr2addr conversion from callers.

Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Apr 4, 2022
explicitly distinguish pointer::addr and pointer::expose_addr

`@bgeron` pointed out that the current docs promise that `ptr.addr()` and `ptr as usize` are equivalent. I don't think that is a promise we want to make. (Conceptually, `ptr as usize` might 'escape' the provenance to enable future `usize as ptr` casts, but `ptr.addr()` dertainly does not do that.)

So I propose we word the docs a bit more carefully here. `@Gankra` what do you think?
@RalfJung
Copy link
Member Author

RalfJung commented Apr 4, 2022

Strictly speaking, it is only from_exposed_addr that Miri can't support. I would be inclined to only warn / error on this function and say nothing on expose_addr (treat it the same as addr). You can still conform to strict provenance and use expose_addr: maybe you are being defensive against client code that may or may not be following strict provenance.

Hm, I would have thought people might want warnings against all non-strict-provenance operations, but you are right that there is no loss in analysis precision until from_exposed_addr is called so at least by default we probably shouldn't warn any earlier than that.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 4, 2022

@bors r-

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. labels Apr 4, 2022
@RalfJung
Copy link
Member Author

RalfJung commented Apr 4, 2022

@bors r=scottmcm

@bors
Copy link
Contributor

bors commented Apr 4, 2022

📌 Commit 7138a322ef76a658ab240c97264c33937cfb815c has been approved by scottmcm

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Apr 4, 2022
@rust-log-analyzer

This comment has been minimized.

@RalfJung
Copy link
Member Author

RalfJung commented Apr 4, 2022

@bors r=scottmcm

@bors
Copy link
Contributor

bors commented Apr 4, 2022

📌 Commit 0252fc9 has been approved by scottmcm

Dylan-DPC added a commit to Dylan-DPC/rust that referenced this pull request Apr 4, 2022
explicitly distinguish pointer::addr and pointer::expose_addr

`@bgeron` pointed out that the current docs promise that `ptr.addr()` and `ptr as usize` are equivalent. I don't think that is a promise we want to make. (Conceptually, `ptr as usize` might 'escape' the provenance to enable future `usize as ptr` casts, but `ptr.addr()` dertainly does not do that.)

So I propose we word the docs a bit more carefully here. `@Gankra` what do you think?
bors added a commit to rust-lang-ci/rust that referenced this pull request Apr 5, 2022
Rollup of 7 pull requests

Successful merges:

 - rust-lang#91873 (Mention implementers of unsatisfied trait)
 - rust-lang#95588 (explicitly distinguish pointer::addr and pointer::expose_addr)
 - rust-lang#95603 (Fix late-bound ICE in `dyn` return type suggestion)
 - rust-lang#95620 (interpret: remove MemoryExtra in favor of giving access to the Machine)
 - rust-lang#95630 (Update `NonNull` pointer provenance methods' documentation)
 - rust-lang#95631 (Refactor: remove unnecessary nested blocks)
 - rust-lang#95642 (`CandidateSource::XCandidate` -> `CandidateSource::X`)

Failed merges:

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 3bf33b9 into rust-lang:master Apr 5, 2022
@rustbot rustbot added this to the 1.62.0 milestone Apr 5, 2022
@RalfJung RalfJung deleted the strict-provenance branch April 5, 2022 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-strict-provenance Area: Strict provenance for raw pointers S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.