Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions around what is UB and what is defined when using raw pointers #205

Closed
elichai opened this issue Sep 18, 2019 · 35 comments
Closed
Labels
C-support Category: Supporting a user to solve a concrete problem

Comments

@elichai
Copy link

elichai commented Sep 18, 2019

Hi,
Read quite a lot of the information out there and I still have a bunch of open questions about pointers, answering them will help me but I also think that it will be good to add to the docs for other people.

This list isn't exhaustive and I'll try to add stuff I see and don't know the answer to:

  1. Passing a pointer to a const variable. -> Is fine but may result in different pointers to the same variable (See Some questions around what is UB and what is defined when using raw pointers #205 (comment))
  2. Mutating a const variable through raw pointer casting. -> Not allowed. (See Some questions around what is UB and what is defined when using raw pointers #205 (comment))
  3. Casting a pointer to usize and back.
  4. Checking alignment via modulo over ptr as usize. (related to 3) (less relevant now that we have align_offset but people still do it in the wild).
@elichai
Copy link
Author

elichai commented Sep 18, 2019

I think 2 can fall as UB under:

In general, transmuting an &T type into an &mut T is considered undefined behavior.

This is a quote from: the docs of UnsafeCell. and should also be mentioned here in my opinion.

@bjorn3
Copy link
Member

bjorn3 commented Sep 18, 2019

Mutating a const variable through raw pointer casting.

Thats simply imposible, consts dont have any address, when you take &MY_CONST a new promoted will be created. That promoted will be in read only memory.

@elichai
Copy link
Author

elichai commented Sep 18, 2019

@bjorn3 you're talking in a mindset of the rustc low level.
In practice this can happen on purpose or on accident and will currently cause a Segmentation Fault https://play.rust-lang.org/?gist=99b3e1414761fb27a176d960ac56baa0 (while I get that probably it doesn't try to modify the const, but probably do something else).

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 18, 2019

@elichai you did not understand what @bjorn3 mentioned: that code does not create a pointer to a const (EDIT: re-read @bjorn3 explanation, they explained it properly).

@elichai
Copy link
Author

elichai commented Sep 18, 2019

re-read it again. and what you're saying is that every call to &CONST creates a copy of that const in read only memory?
meaning that giving a raw pointer to that const is fine(and (&MY_CONST as *const _) != (&MY_CONST as *const _)) but mutating it will probably crash the program(because it's read only memory).
Did I understand correctly now?

@bjorn3
Copy link
Member

bjorn3 commented Sep 18, 2019

and what you're saying is that every call to &CONST creates a copy of that const in read only memory?

Conceptually, yes, but within one object file it is deduplicated, but copies can exist in different crates.

meaning that giving a raw pointer to that const is fine

Yes

and (&MY_CONST as *const _) != (&MY_CONST as *const _)

As said above within one object file it is likely equal, but between different crates it isnt.

but mutating it will probably crash the program(because it's read only memory).

Yes

@bjorn3
Copy link
Member

bjorn3 commented Sep 18, 2019

Casting a pointer to usize and back.

I believe it is always allowed, but impedes some optimizations

Checking alignment via modulo over ptr as usize. (related to 3) (less relevant now that we have align_offset but people still do it in the wild).

Allowed, but miri used to not support ptr<->int casts, so align_offset was invented.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 18, 2019

and what you're saying is that every call to &CONST creates a copy of that const in read only memory?

Conceptually, yes, but within one object file it is deduplicated, but copies can exist in different crates.

That's more an optimization that is sound for programs that don't exhibit UB, than the semantics of the language it self.

In the abstract machine, using a CONST creates a new temporary allocation and makes a bitwise copy of the CONST to this allocation. That's all there is to it.

When you write &CONST you are creating a pointer to that allocation, and that pointer is the only way to access the allocation. If the type of the CONST is Freeze there is no way for a Rust program to modify the contents of the allocation from that pointer or pointers derived from it. So we know that the allocation is read-only, and therefore it is correct to put them in the rodata segment of the binary. This is the optimization @bjorn3 mentioned.

However, if you create a &mut T to the temporary, then writing to it is perfectly fine. If you create a &T and T: !Freeze, then writing to it is fine as well . These are just normal Rust semantics at work, there is nothing specific to consts here. These same rules apply to any allocation in the program.

and (&MY_CONST as *const _) != (&MY_CONST as *const _)

Since the allocation is a local private temporary, that's like writing this:

let x = 3;
let y = 3;
assert!(&x as *const _ as usize != &y as *const _ as usize);

We do not provide any guarantees about what the addresses of x and y are. You can compare their addresses - that's not UB - but there are no guarantees about what the outcome of that assert is.


Casting a pointer to usize and back.

This is not UB.


Checking alignment via modulo over ptr as usize.

This is not UB either. At most is a logic bug, and sometimes, it even works correctly.

@elichai
Copy link
Author

elichai commented Sep 18, 2019

@gnzlbg Thank you. I finally think I understand consts in rust properly :)
For a sec it didn't make sense to me how &mut MY_CONST is even allowed but I get it now.
consts get copied and become a temporary on every invocation.
Thanks for clearing that up :).

This is not UB.
Assumed that. but good to know that's defined.

This is not UB either. At most is a logic bug, and sometimes, it even works correctly.

So that's not a proper way to check alignment? (by looking at the implementation of offset_align I assumed it's not really the right way to do this heh)

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 18, 2019

So that's not a proper way to check alignment? (by looking at the implementation of offset_align I assumed it's not really the right way to do this heh)

That's how I understand it. It is not a good way to check alignment, but it is not UB per se. If that returns some incorrect result, your program might or might not do something later that is UB.

@RalfJung RalfJung added the C-open-question Category: An open question that we should revisit label Sep 18, 2019
@RalfJung
Copy link
Member

I think 4 can fall as UB under:

In general, transmuting an &T type into an &mut T is considered undefined behavior.

4 is about alignment. What am I missing?


Mutating a const variable through raw pointer casting. -> Not allowed.

The conclusion was actually more like "not syntactically possible".
It's like asking "what is the behavior of adding a unicorn to 3?". It's not that Rust disallows it, is that this is not a question that is even well-formed in the context of Rust.

Mutating read-only memory through raw ptr casting is UB, and the reference says so:

Mutating immutable data. All data inside a const item is immutable. Moreover, all data reached through a shared reference or data owned by an immutable binding is immutable, unless that data is contained within an UnsafeCell.


Casting a pointer to usize and back.

That's not a question. But know that int-ptr-casts are very poorly understood in C and LLVM, and Rust inherits this. But given that in Rust you can only cast usize with raw pointers, nothing can go wrong. These are even safe operations so I wonder why you might even think they are UB?


Checking alignment via modulo over ptr as usize. (related to 3)

Indeed that's the only way to actually check alignment. align_offset is not for checking alignment, it is for aligning things.

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 19, 2019

All data inside a const item is immutable.

That makes it sound like const items contain data, but I don't think this is the case, e.g., 3 does not contain any data.

@elichai
Copy link
Author

elichai commented Sep 19, 2019

4 is about alignment. What am I missing?

Sorry fixed. I meant 2.

The conclusion was actually more like "not syntactically possible".

I still think it's worth explaining because people who aren't familiar with how rust operates consts (like me before reading this thread :) ) it sounds confusing to tell them that something "isn't possible" when they can easily do &A as *const _ as *mut _ so I think it's worth explaining why it's not syntactically possible.

That's not a question. But know that int-ptr-casts are very poorly understood in C and LLVM, and Rust inherits this. But given that in Rust you can only cast usize with raw pointers, nothing can go wrong. These are even safe operations so I wonder why you might even think they are UB?

If I wasn't clear enough I meant will it be UB when dereferencing. Can int-ptr-cast change the pointer in some edge case or is this completely fine to do and then dereference(assuming the original ptr was fine to dereference).

Indeed that's the only way to actually check alignment. align_offset is not for checking alignment, it is for aligning things.

What's wrong with assert!(ptr.align_offset(8)==0)?. are you saying it's better to do the modulo?

@comex
Copy link

comex commented Sep 20, 2019

let x = 3;
let y = 3;
assert!(&x as *const _ as usize != &y as *const _ as usize);

We do not provide any guarantees about what the addresses of x and y are. You can compare their addresses - that's not UB - but there are no guarantees about what the outcome of that assert is.

Interesting. That would be weaker semantics than C, which does guarantee that different variables have distinct addresses (including locals), as long as those variables are all in scope. Has this been discussed before?

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 20, 2019

I don't know but I think it might be worth it to open an issue to discuss that (it might be something worth guaranteeing, but if so, we should write that down somewhere).

@comex
Copy link

comex commented Sep 20, 2019

Okay, just filed #206.

@elichai
Copy link
Author

elichai commented Sep 23, 2019

@gnzlbg @RalfJung Just as an example of the usize/align casting I saw right now: https://github.com/BurntSushi/rust-memchr/blob/master/src/x86/avx.rs#L41

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 23, 2019

That LGTM :/

@elichai
Copy link
Author

elichai commented Sep 23, 2019

@gnzlbg didn't say it's not. I just couldn't fine any official reference docs defining this casting.(or not defining here)
So I wanted to show an example of people actually using it in the wild. so I think it's something interesting to officially define. (Isn't that the point of this guideline?)

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 23, 2019

What would you like to have documented?

Pointer to integer casts, integer to pointer casts, and integer arithmetic, are all safe. AFAICT the only unsafe operation there is dereferencing a raw pointer, and that's already documented in the reference and in the nomicon.

@elichai
Copy link
Author

elichai commented Sep 23, 2019

@gnzlbg
I'll give you an example of what i'm talking about.
We can both agree that this will result in UB:
unsafe {*(&5 as *const i32 as i8 as *const i32)};
Even though this is unsafe we'll both agree this is defined and fine:
unsafe {*(&5 as *const i32)};
My question is, is this defined and fine?(and if so being documented will be helpful)
unsafe {*(&5 as *const i32 as usize as *const i32)};
And does casting to usize and moduloing as a way to check alignment before dereferencing is enough or not? (from what @RalfJung said it might be enough but isn't promised)

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 23, 2019

Your examples have only one unsafe operation: a pointer dereference. In the reference and the nomicon we document that, when you dereference a pointer, that pointer must not be null, it must be aligned for its type, it must be dereferenceable for the size of the type, and it must point to memory containing a valid value of the type.

When you write the unsafe { } block to perform it, you are claiming that all these conditions hold; if they don't, the behavior is undefined.

So is unsafe { *(42_usize as *const i32) } UB? Maybe, or maybe not, depends on whether those conditions hold.

@elichai
Copy link
Author

elichai commented Sep 23, 2019

@gnzlbg Ok, so i'll rephrase my questions accordingly :)

  1. Is casting a pointer to usize and back promise that all the conditions on the original pointer are still valid on the resulting pointer?

@gnzlbg
Copy link
Contributor

gnzlbg commented Sep 23, 2019

Is casting a pointer to usize and back promise that all the conitions on the original pointer are still valid on the resulting pointer?

See https://github.com/rust-lang-nursery/reference/blob/master/src/expressions/operator-expr.md#type-cast-expressions - Casting a pointer to an usize gives you the address of the pointer. Casting an usize to a pointer give you a pointer with that address. If the pointer was nonnull and aligned, casting it to an usize and back does not change these properties. If another thread frees the memory behind the pointer while you are doing the casts, then the pointer won't be dereferenceable after the casts.

@RalfJung
Copy link
Member

RalfJung commented Oct 10, 2019

@elichai

What's wrong with assert!(ptr.align_offset(8)==0)?. are you saying it's better to do the modulo?

Yes, modulo is better. Read the docs for align_offset to learn why (hint: Ctrl-F "permissible").

And does casting to usize and moduloing as a way to check alignment before dereferencing is enough or not? (from what @RalfJung said it might be enough but isn't promised)

Yes it is, this follows directly from what alignment is. The address represented by the pointer must be divisible by the alignment without remainder. Is your question here how alignment is defined? That is something we could clarify in the Reference / Nomicon, I suppose.

However, that doesn't seem to be the problem; you seem to know alignment is about being divisible without a remainder. But then I am surprised by these questions. I'm afraid if we start to list the answer questions like "after I checked x % 4 == 0, is x guaranteed to be divisible by 4" we'll never be done. I appreciate you being careful around unsafe, but I wonder where this distrust of even basic arithmetic behaving the only way it possibly could is coming from?

The only document that could answer such questions conclusively is a proper spec, at least a partial one, describing the behavior of these expressions. But then that would be really hard to read as well.

@gnzlbg

and it must point to memory containing a valid value of the type.

Correction: we don't require that for raw ptr derefs. Validity of values only comes up when "producing a value of some type" (what we call "typed copy" in the UCG). So you can do &raw const *ptr and no validity requirements come up.

Casting a pointer to an usize gives you the address of the pointer. Casting an usize to a pointer give you a pointer with that address.

Agreed, though "address of" sounds like & to me so I usually avoid that wording here. "The address represented by that pointer" might be clearer.

Also note (this is mostly directed to anyone reading along), casting a ptr to usize and back does not give you the same pointer, it just gives you a pointer representing the same address! The provenance of the pointer might differ.


To go slightly meta, I think we discussed most of the points in this issue. What do we want to do before closing it? I feel this is as good an opportunity as any to start a "FAQ" document in this repo where we can collect answers to various specific questions that come up, and that do not fit the "discussion topic" scheme. So the issue here could be closed by adding answers to some of these questions to the FAQ (I am not sure if all of them are eligible, e.g. I don't think we should repeat the align_offset docs).

@elichai
Copy link
Author

elichai commented Oct 16, 2019

Another question.
is casting pointer to i32/i64(depending on the architecture) is a valid way to get the address?
I see places in C where people cast pointers to long.
(btw, why doesn't c_long defined as isize?)
if there's any good place to read into that stuff I'd love references :)

@bjorn3
Copy link
Member

bjorn3 commented Oct 16, 2019

(btw, why doesn't c_long defined as isize?)

Because the C standard doesnt guarantee long is pointer sized. Only that it is at least as big as int.

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 16, 2019

@elichai

is casting pointer to i32/i64(depending on the architecture) is a valid way to get the address?

Since that works on stable Rust, and stable Rust does not have undefined behavior, I'd say that it has to be.

@elichai
Copy link
Author

elichai commented Oct 16, 2019

@gnzlbg I meant that this is for passing to ffi that will cast back to pointer and dereference it (i.e syscalls)

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 16, 2019

In Rust you can cast a pointer to a pointer-sized int, and you can pass that int to FFI. It's up to the unknown code at the other side to make sure it only does correct things with that, but from Rust pov, you can cast pointers to int and back without problems.

@elichai
Copy link
Author

elichai commented Oct 16, 2019

Weird to me that @bjorn3 is saying that long isn't guaranteed to be pointer sized because musl and the kernel uses that in the syscalls even for pointers
https://github.com/ifduyue/musl/blob/master/arch/x86_64/syscall_arch.h

So I'm not really sure what to make of it. And should I use isize or c_long for registers for syscalls

@gnzlbg
Copy link
Contributor

gnzlbg commented Oct 16, 2019

Weird to me that @bjorn3 is saying that long isn't guaranteed to be pointer sized because musl and the kernel uses that in the syscalls even for pointers

@bjorn3 is probably referring to what the C standard guarantees, which is that long is at least as wide as an int, but can be wider or something like that. The types that the C standard guarantees that are pointer-sized are uintptr_t and intptr_t.

A specific platform, like Linux + x86_64, can offer more guarantees. These guarantees just aren't necessarily portable to other platforms.

@comex
Copy link

comex commented Oct 16, 2019

Indeed. The Linux kernel (and apparently musl as well) pervasively assumes that unsigned long is pointer-sized, which is fine for Linux, but it's not true on all platforms. On x86_64 Windows, long is 32-bit while pointers are 64-bit.

For syscall code which is both arch- and platform-specific, you can use whatever you want – isize, c_long, i64...

Edit: If you're wondering why Linux does that: uintptr_t comes from C99, which the Linux kernel predates by several years. C99 also originated the fixed-width aliases like uint32_t and uint64_t, which is why Linux uses its own aliases instead like u32 and u64.

@elichai
Copy link
Author

elichai commented Oct 16, 2019

Thanks. That clarifies it for me.
Didn't think about the fact that the OS is part of the reason for integers width and not just the hardware.

@RalfJung RalfJung added C-support Category: Supporting a user to solve a concrete problem and removed C-open-question Category: An open question that we should revisit labels Nov 3, 2019
@RalfJung RalfJung changed the title Imporving what is UB and what is defined in pointers Some question around what is UB and what is defined when using raw pointers Nov 3, 2019
@RalfJung RalfJung changed the title Some question around what is UB and what is defined when using raw pointers Some questions around what is UB and what is defined when using raw pointers Nov 3, 2019
@RalfJung
Copy link
Member

RalfJung commented Aug 8, 2020

I think the questions here have been answered.

@RalfJung RalfJung closed this as completed Aug 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-support Category: Supporting a user to solve a concrete problem
Projects
None yet
Development

No branches or pull requests

5 participants