Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Implement stable symbol-name generation algorithm. #31539

Closed

Conversation

michaelwoerister
Copy link
Member

This PR changes the way symbol names are generated by the compiler. The new algorithm reflects the current state of the discussion over at rust-lang/rfcs#689.

Once it is done, it will also fix issue #30330. I want to add a test case for that before closing it though.

I also want to do some performance tests. The new algorithm does a little more work than the previous one due to various reasons, and it might make sense to adapt it in a way that allows it to be implemented more efficiently.

@nikomatsakis: It would be nice if there was a way of finding out if a DefPath refers to something in the current crate or in an external one. The information is already there, it's just not accessible at the moment. I'll probably propose some minor changes there, together with some facilities to allow for accessing DefPaths without allocating a Vec for them.

TODO

  • Actually "crate qualify" symbols, as promised in the docs.
  • Add a test case showing that symbol names are deterministic.
  • Maybe add a test case showing that symbol names are stable against small code changes.

One thing that might be interesting to the @rust-lang/compiler team:
I've used SipHash exclusively now for generating symbol hashes. Previously it was only used for monomorphizations and the rest of the code used a truncated version on SHA256. Is there any benefit to sticking to SHA? I don't really see one since we only used 64 bits of the digest anyway, but maybe I'm missing something?
==> Just switched things back to SHA-2 for now.

@rust-highfive
Copy link
Collaborator

r? @Aatch

(rust_highfive has picked a reviewer for you, use r? to override)

@bluss
Copy link
Member

bluss commented Feb 10, 2016

SipHash it is an algorithm designed by cryptographers, but it's not in the same category of hash functions as SHA-2 which is a cryptographic hash function.

For example, the SipHash paper says

In general, building a MAC from a general-
purpose cryptographic hash function appears to be a highly suboptimal ap-
proach: general-purpose cryptographic hash functions perform many extra com-
putations for the goal of collision resistance on public inputs, while MACs have
secret keys and do not need collision resistance.

and

We comment that SipHash is not meant to be, and (obviously) is not, collision-
resistant.

This use case does not use a secret key with SipHash (it always uses the default key), and it wants to have collision resistance on public inputs. Yes, in practice, we can find collisions for either of them if the hash is only 64 bits long, but only SHA-2 is designed for this case.

@michaelwoerister
Copy link
Member Author

@bluss Those are good references. It really sounds like SipHash is not a good candidate for this use case. I'll switch things to SHA-2.

@bluss
Copy link
Member

bluss commented Feb 11, 2016

What's the impact of a collision?

@bors
Copy link
Contributor

bors commented Feb 11, 2016

☔ The latest upstream changes (presumably #31487) made this pull request unmergeable. Please resolve the merge conflicts.

@michaelwoerister
Copy link
Member Author

What's the impact of a collision?

Symbol conflicts: Two functions with different signatures could end up with the same name. If it's in the same crate, compilation will probably just fail. If it's in different crates, either linking fails due to a conflict or the wrong symbol might be linked, resulting most probably in untraceable, subtle memory corruption.

@nikomatsakis
Copy link
Contributor

Nice.

@nikomatsakis
Copy link
Contributor

cc @alexcrichton @brson

@alexcrichton
Copy link
Member

This looks fantastic to me, nice work @michaelwoerister!

I wouldn't have too many opinions about SHA2 vs SipHash, I suspect either choice will be suitable for our purposes. I'd personally love to jettison the SHA2 implementation in the compiler, and I highly doubt we can find a SipHash practical collision with the same path, different metadata, and yet something like different type substitutions. We should also be able to change this if we ever need to really as we can only guarantee no recompiles within one version of Rust anyway.

@michaelwoerister
Copy link
Member Author

Thanks for the comment @alexcrichton. It's true, we have been using SipHash for function monomorphization for quite a while now and I've never seen a problem with that.

My assumption about the requirements for this hash function (and later also one used for fingerprinting incremental compilation artifacts) is that it should provide good distribution so that collision probability is as low as it can theoretically be for a given number of hash value bits, BUT that we don't care about "security", i.e. whether it's easy to find hash collisions on purpose. Or is there some kind of "attack scenario" that I'm overlooking?

For maximum performance (and if the above assumptions hold), we could also use something like CRC64.
Anyway, I think for this PR, I'll just leave SHA-2 in.

@bluss
Copy link
Member

bluss commented Feb 12, 2016

Here's the issue about collisions in type_id, but that's only tangential to this: #10389. As long as we go into this with the knowledge that SipHash is just a regular hash function, I'm fine.

@michaelwoerister
Copy link
Member Author

@bluss Thanks for the link!

@michaelwoerister
Copy link
Member Author

As an aside, I think "-C metadata" is a bit of a misnomer. Something like "-C salt" or "-C disambiguator" would be clearer (although the later one definitely is too long).

@michaelwoerister
Copy link
Member Author

Another side note: Now that we get predictable symbol names for monomorphizations, we might want to start sharing them cross-crate.

@alexcrichton
Copy link
Member

Ah as another data point in Cargo we extensively use SipHash for fingerprinting crates and packages. I haven't at least seen anyone run into any problems with that.

I also agree that at this point -C metadata is basically just wrong, but I don't think we can rename due to the extensive usage in Cargo right now :(

@bors
Copy link
Contributor

bors commented Feb 13, 2016

☔ The latest upstream changes (presumably #31524) made this pull request unmergeable. Please resolve the merge conflicts.

@arielb1
Copy link
Contributor

arielb1 commented Feb 13, 2016

What's the problem with the compiler SHA2 implementation? Making a collision with SipHash with a known key only requires some modifiable string on both ends - e.g. an item name, or comment.

@michaelwoerister
Copy link
Member Author

@arielb1 I don't think anything is wrong with the SHA-2 implementation per se. It's just incomplete (only supports SHA256) and it's one more piece of code that needs to be maintained. I'd guess that second point is the main motivation for wanting to get rid of it.

@michaelwoerister
Copy link
Member Author

So, here's something interesting: Some of the test cases (e.g. run-pass/typeid-intrinsic.rs) link to an external crate that has exactly the same name, and thus the crate graph ends up with two identical crates in it. So far, this was no problem because the two crates still have a different SVH but under the new scheme, two crates are considered the same if they have the same name/metadata pair.

The tests now fail with 'duplicate symbol' linker errors, but I guess the sensible thing to do here, is to emit a compilation error when the crate loader discovers two crates with the same name/metadata but different SVHs.

@michaelwoerister
Copy link
Member Author

Up until that last commit this passes make check for me locally, with the last commit it looks good too, except that the run-pass/backtrace.rs test case fails (it seems to panic while trying to write the backtrace). Does anybody have an idea what that could be about? In general, creating backtraces seems to work just fine...

@michaelwoerister
Copy link
Member Author

That's the error in question: https://travis-ci.org/rust-lang/rust/builds/109449298#L6591

@alexcrichton
Copy link
Member

Oh you should just be able to update the test to check for backtrace::foo instead of only foo, I believe that previously the crate name was never on symbols, just in the debug description of symbols.

@michaelwoerister
Copy link
Member Author

@alexcrichton I'm afraid it's something more complicated than that. If you look at the actual backtrace that the subprocess provides (see below), it looks like there is a panic while creating the backtrace (in std::sys::backtrace::tracing::imp::write in gcc_s.rs). Something causes an error there.

error: test run failed!
status: exit code: 101
command: x86_64-unknown-linux-gnu/test/run-pass/backtrace.stage2-x86_64-unknown-linux-gnu 
stdout:
------------------------------------------

------------------------------------------
stderr:
------------------------------------------
thread '<main>' panicked at 'bad output: thread '<main>' panicked at 'explicit panic', /home/travis/build/rust-lang/rust/src/test/run-pass/backtrace.rs:23
stack backtrace:
   1:     0x2b4313b5fec0 - std::sys::backtrace::tracing::imp::write::h5cb8831505517f1a
   2:     0x2b4313b68e3b - std::panicking::default_handler::_$u7b$$u7b$closure$u7d$$u7d$::h7d4838ac3001375c
   3:     0x2b4313b68a08 - std::panicking::default_handler::heb8f66506b25fc53
   4:     0x2b4313b31ac6 - std::sys_common::unwind::begin_unwind_inner::h896acc46de4d4a1c
   5:     0x2b4313671e44 - std::sys_common::unwind::begin_unwind::h72328ba1025302f0
                        at src/libstd/sys/common/unwind/mod.rs:236
   6:     0x2b4313671dd3 - backtrace::foo::h2bfe25e162df4c7c
                        at /home/travis/build/rust-lang/rust/<std macros>:3
   7:     0x2b43136732a5 - backtrace::main::h1e5dc0fe5e6f026b
                        at src/test/run-pass/backtrace.rs:98
   8:     0x2b4313b68474 - std::sys_common::unwind::try::try_fn::hbb3b06ed012dffa6
   9:     0x2b4313b5d82b - __rust_try
  10:     0x2b4313b680b1 - std::rt::lang_start::ha4a60ae3cf761e9f
  11:     0x2b431424eec4 - __libc_start_main
  12:     0x2b4313671c28 - <unknown>
  13:                0x0 - <unknown>
', /home/travis/build/rust-lang/rust/src/test/run-pass/backtrace.rs:54
note: Run with `RUST_BACKTRACE=1` for a backtrace.

------------------------------------------

@alexcrichton
Copy link
Member

Hm, that looks normal to me though? It looks like the assertion message originates from here which was testing a panic originating here. The bad output happened because it expected to find - foo where it instead printed - backtrace::foo.

In other words we should probably really look into skipping the top few frames when printing a backtrace, but other than that it should be good!

@michaelwoerister
Copy link
Member Author

Yes, you're right of course :)
The test passes just fine now.

…s with the same crate-name and crate-salt but different SVHs.
…runtime.

We want to prevent compiling something against one version
of a dynamic library and then, at runtime accidentally
using a different version of the dynamic library. With the
old symbol-naming scheme this could not happen because every
symbol had the SVH in it and you'd get an error by the
dynamic linker when using the wrong version of a dylib. With
the new naming scheme this isn't the case any more, so this
patch adds the "link-guard" to prevent this error case.

This is implemented as follows:

- In every crate that we compile, we emit a function called
  "__rustc_link_guard_<crate-name>_<crate-svh>"
- The body of this function contains calls to the
  "__rustc_link_guard" functions of all dependencies.
- An executable contains a call to it's own
  "__rustc_link_guard" function.

As a consequence the "__rustc_link_guard" function call graph
mirrors the crate graph and the dynamic linker will fail if a
wrong dylib is loaded somewhere because its
"__rustc_link_guard" function will contain a different SVH in
its name.
@michaelwoerister
Copy link
Member Author

@bors r=nikomatsakis (after rebase)

@bors
Copy link
Contributor

bors commented Mar 15, 2016

📌 Commit f6b0f17 has been approved by nikomatsakis

@bors
Copy link
Contributor

bors commented Mar 15, 2016

⌛ Testing commit f6b0f17 with merge 8c4d880...

bors added a commit that referenced this pull request Mar 15, 2016
WIP: Implement stable symbol-name generation algorithm.

This PR changes the way symbol names are generated by the compiler. The new algorithm reflects the current state of the discussion over at rust-lang/rfcs#689.

Once it is done, it will also fix issue #30330. I want to add a test case for that before closing it though.

I also want to do some performance tests. The new algorithm does a little more work than the previous one due to various reasons, and it might make sense to adapt it in a way that allows it to be implemented more efficiently.

@nikomatsakis: It would be nice if there was a way of finding out if a `DefPath` refers to something in the current crate or in an external one. The information is already there, it's just not accessible at the moment. I'll probably propose some minor changes there, together with some facilities to allow for accessing `DefPaths` without allocating a `Vec` for them.

**TODO**
 - ~~Actually "crate qualify" symbols, as promised in the docs.~~
 - ~~Add a test case showing that symbol names are deterministic~~.
 - Maybe add a test case showing that symbol names are stable against small code changes.

~~One thing that might be interesting to the @rust-lang/compiler team:
I've used SipHash exclusively now for generating symbol hashes. Previously it was only used for monomorphizations and the rest of the code used a truncated version on SHA256. Is there any benefit to sticking to SHA? I don't really see one since we only used 64 bits of the digest anyway, but maybe I'm missing something?~~ ==> Just switched things back to SHA-2 for now.
@bors
Copy link
Contributor

bors commented Mar 15, 2016

💔 Test failed - auto-mac-64-nopt-t

@nikomatsakis
Copy link
Contributor

@nikomatsakis
Copy link
Contributor

@michaelwoerister and I chatted a bit over e-mail. I'm going to close this PR and open one of my own that includes his commits plus a few more.

bors added a commit that referenced this pull request Mar 23, 2016
Revamp symbol names for impls (and make them deterministic, etc)

This builds on @michaelwoerister's epic PR #31539 (note that his PR never landed, so I just incorporated it into this one). The main change here is that we remove the "name" from `DefPathData` for impls, since that name is synthetic and not sufficiently predictable for incr comp. However, just doing that would cause bad symbol names since those are based on the `DefPath`. Therefore, I introduce a new mechanism for getting symbol names (and also paths for user display) called `item_path`. This is kind of simplistic for now (based on strings) but I expect to expand it later to support richer types, hopefully generating C++-mangled names that gdb etc can understand. Along the way I cleaned up how we track the path that leads to an extern crate.

There is still some cleanup left undone here. Notably, I didn't remove the impl names altogether -- that would probably make sense. I also didn't try to remove the `item_symbols` vector. Mostly I want to unblock my other incr. comp. work. =)

r? @eddyb
cc @eddyb @alexcrichton @michaelwoerister
bors added a commit that referenced this pull request Mar 25, 2016
…xcrichton

Revamp symbol names for impls (and make them deterministic, etc)

This builds on @michaelwoerister's epic PR #31539 (note that his PR never landed, so I just incorporated it into this one). The main change here is that we remove the "name" from `DefPathData` for impls, since that name is synthetic and not sufficiently predictable for incr comp. However, just doing that would cause bad symbol names since those are based on the `DefPath`. Therefore, I introduce a new mechanism for getting symbol names (and also paths for user display) called `item_path`. This is kind of simplistic for now (based on strings) but I expect to expand it later to support richer types, hopefully generating C++-mangled names that gdb etc can understand. Along the way I cleaned up how we track the path that leads to an extern crate.

There is still some cleanup left undone here. Notably, I didn't remove the impl names altogether -- that would probably make sense. I also didn't try to remove the `item_symbols` vector. Mostly I want to unblock my other incr. comp. work. =)

r? @eddyb
cc @eddyb @alexcrichton @michaelwoerister
bors added a commit that referenced this pull request Mar 26, 2016
…xcrichton

Revamp symbol names for impls (and make them deterministic, etc)

This builds on @michaelwoerister's epic PR #31539 (note that his PR never landed, so I just incorporated it into this one). The main change here is that we remove the "name" from `DefPathData` for impls, since that name is synthetic and not sufficiently predictable for incr comp. However, just doing that would cause bad symbol names since those are based on the `DefPath`. Therefore, I introduce a new mechanism for getting symbol names (and also paths for user display) called `item_path`. This is kind of simplistic for now (based on strings) but I expect to expand it later to support richer types, hopefully generating C++-mangled names that gdb etc can understand. Along the way I cleaned up how we track the path that leads to an extern crate.

There is still some cleanup left undone here. Notably, I didn't remove the impl names altogether -- that would probably make sense. I also didn't try to remove the `item_symbols` vector. Mostly I want to unblock my other incr. comp. work. =)

r? @eddyb
cc @eddyb @alexcrichton @michaelwoerister
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants