Skip to content

Latest commit

 

History

History
1179 lines (934 loc) · 46.9 KB

2603-rust-symbol-name-mangling-v0.md

File metadata and controls

1179 lines (934 loc) · 46.9 KB

Summary

This RFC proposes a new mangling scheme that describes what the symbol names generated by the Rust compiler look like. This new scheme has a number of advantages over the existing one which has grown over time without a clear direction. The new scheme is consistent, depends less on compiler internals, and the information it stores in symbol names can be decoded again which provides an improved experience for users of external tools that work with Rust symbol names.

Note that, at this point, the new mangling scheme would not be part of the language specification or the specification of a stable Rust ABI. In the future it could be part of both and it is designed to be stable and extensible; but for the time being it would still be an implementation detail of the Rust compiler.

Motivation

Due to its ad-hoc nature, the compiler's current name mangling scheme has a number of drawbacks:

  • Information about generic parameters and other things is lost in the mangling process. One cannot extract the type arguments of a monomorphized function from its symbol name.

  • The current scheme is inconsistent: most paths use Itanium ABI style encoding, but some don't.

  • The symbol names it generates can contain . characters which is not generally supported on all platforms. [1] [2] [3]

  • It depends on compiler internals and its results cannot be replicated by another compiler implementation or external tool.

The proposed scheme solves these problems:

  • It encodes information about generic parameters in a reversible way.
  • It has a consistent definition that does not rely on pretty-printing certain language constructs.
  • It generates symbols that only consist of the characters A-Z, a-z, 0-9, and _.
  • While the proposed scheme still contains things that are implementation defined it has a clearer path towards full name predictability in future.

These properties should make it easier for third party tools to work with Rust binaries.

Guide-level explanation

The following section will lay out the requirements for a name mangling scheme and then introduce the actual scheme through a series of ever more complex examples.

Requirements for a Symbol Mangling Scheme

A symbol mangling scheme has a few goals, one of them essential, the rest of them desirable. The essential one is:

  • The scheme must provide an unambiguous string encoding for everything that can end up in a binary's symbol table.

"Unambiguous" means that no two distinct compiler-generated entities (that is, mostly object code for functions) must be mapped to the same symbol name. This disambiguation is the main purpose of the hash-suffix in the current, legacy mangling scheme. The scheme proposed here, on the other hand, achieves it in a way that allows to also satisfy a number of additional desirable properties of a mangling scheme:

  • A mangled symbol should be decodable to some degree. That is, it is desirable to be able to tell which exact concrete instance of e.g. a polymorphic function a given symbol identifies. This is true for external tools, backtraces, or just people only having the binary representation of some piece of code available to them. With the current scheme, this kind of information gets lost in the magical hash-suffix.

  • A mangling scheme should be platform-independent. This is mainly achieved by restricting the character set to A-Z, a-z, 0-9, _. All other characters might have special meaning in some context (e.g. . for MSVC DEF files) or are simply not supported (e.g. Unicode).

  • The scheme should be efficient, meaning that the symbols it produces are not unnecessarily long (because that takes up space in object files and means more work for the compiler and the linker). In addition, generating or demangling a symbol name should not be too computationally expensive.

  • When used as part of a stable ABI, it should be possible to predict the symbol name for a given source-level construct. For example, given the definition fn foo<T>() { ... }, the scheme should allow to construct, by hand, the symbol names for e.g. foo<u32> or foo<extern fn(i32, &mut SomeStruct<(char, &str)>, ...) -> !>(). Since the current scheme generates its hash from the values of various compiler internal data structures, an alternative compiler implementation could not predict the symbol name, even for simple cases. Note that the scheme proposed here does not fulfill this requirement either (yet) as some things are still left to the compiler implementation.

The RFC also has a couple of non-goals:

  • The mangling scheme does not try to be compatible with an existing (e.g. C++) mangling scheme. While it might sound tempting to encode Rust symbols with an existing scheme, it is the author's opinion that the actual benefits are small (C++ tools would not demangle to Rust syntax, demanglings would be hard to read) and at the same time supporting a Rust-specific scheme in existing tools seems quite feasible (many tools like GDB, LLDB, binutils, and valgrind already have specialized code paths for Rust symbols).

  • The RFC does not try to define a standardized demangled form for symbol names. It defines the mangled form and makes sure it can be demangled in an efficient manner but different demanglers still have some degree of freedom regarding how symbol names are presented to the user.

The Mangling Scheme by Example

This section will develop an overview of the mangling scheme by walking through a number of examples. We'll start with the simplest case -- and will see how that already involves things that might be surprising.

Free-standing Functions and Statics

A free-standing function is fully identified via its absolute path. For example, the following function

mod foo {
  fn bar() {}
}

has the path foo::bar and NN3foo3bar is a possible mangling of that path that complies to the character set we are restricted to. Why this format with numbers embedded in it? It is a run-length encoding, similar to what the Itanium C++ ABI name mangling scheme uses for identifiers. The scheme proposed here will also use this format because it does not need termination tokens for identifiers (which are hard to come by with our limited character set).

Note that each component in the path (i.e. foo and bar) also has an accompanying start-tag (here N) at the beginning. This start-tag is needed in order for the syntax to be able to represent complex, nested structures as we will see later.

The symbol name above, unfortunately, does not unambiguously identify the function in every context. It is perfectly valid for another crate to also define mod foo { fn bar() {} } somewhere. So in order to avoid conflicts in such cases, the absolute path must always include the crate-id, as in NNC7mycrate3foo3bar. The crate-id has a C start-tag.

There is another possible ambiguity that we have to take care of. Rust has two distinct namespaces: the type and the value namespace. This leads to a path of the form crate_id::foo::bar not uniquely identifying the item bar because the following snippet is legal Rust code:

fn foo() {
    fn bar() {}
}

mod foo {
    fn bar() {}
}

The function foo lives in the value namespace while the module foo lives in the type namespace. They don't interfere. In order to make the symbol names for the two distinct bar functions unique, we thus add a namespace identifier to the start-tag of components where necessary, as in NvNvC7mycrate3foo3bar for the first case and NvNtC7mycrate3foo3bar second case (notice the difference: NvNv... vs NvNt...).

There is one final case of name ambiguity that we have to take care of. Because of macro hygiene, multiple items with the same name can appear in the same context. The compiler internally disambiguates such names by augmenting them with a numeric index. For example, the first occurrence of the name foo within its parent is actually treated as foo'0, the second occurrence would be foo'1, the next foo'2, and so one. The mangling scheme will adopt this setup by prepending a disambiguation prefix to each identifier with a non-zero index. So if macro expansion would result in the following code:

mod foo {
    fn bar'0() {}
    // The second `bar` function was introduced by macro expansion.
    fn bar'1() {}
}

then we would encode the two functions symbols as NvNtC7mycrate3foo3bar and NvNtC7mycrate3foos_3bar respectively (note the s_ prefix in the second case). A very similar disambiguation is needed for avoiding conflicts between crates of the same name but different versions. The same syntactic prefix is thus used for crate-id where we encode the crate disambiguator as in NtNvCs1234_7mycrate3foo3bar. Details on the shape of this prefix are provided in the reference-level description.

As opposed to C++ and other languages that support function overloading, we don't need to include function parameter types in the symbol name. Rust does not allow two functions of the same name but different arguments.

The final symbol name for the function would also include the prefix _R that is common to all symbol names generated by this scheme:

  _RNvNtCs1234_7mycrate3foo3bar
  <>^^^^^<----><------><--><-->
   ||||||   |      |     |   |
   ||||||   |      |     |   +--- "bar" identifier
   ||||||   |      |     +------- "foo" identifier
   ||||||   |      +------------- "mycrate" identifier
   ||||||   +-------------------- disambiguator for "mycrate"
   |||||+------------------------ start-tag for "mycrate"
   ||||+------------------------- namespace tag for "foo"
   |||+-------------------------- start-tag for "foo"
   ||+--------------------------- namespace tag for "bar"
   |+---------------------------- start-tag for "bar"
   +----------------------------- common Rust symbol prefix

Generic Functions

Each monomorphization of a generic function has its own symbol name. The monomorphizations are disambiguated by the list of concrete generic arguments. These arguments are added to the symbol name by a pair of I start-tag at the beginning and a list of the actual arguments at the end. So the instance

std::mem::align_of::<f64>

would be mangled to

_RINvNtC3std3mem8align_ofdE
  ^                      ^^
  |                      ||
  |                      |+--- end of argument list
  |                      +--- f64
  +--- start-tag

where I precedes the thing the arguments belong to, d designates f64 and E ends the argument list. As we can see, we need to be able to represent all kinds of types that can be part of such an argument list. (In the future, when const generics get added to the language, we will also need to represent values) These kinds of types are:

  • basic types (char, (), str, !, i8, i16, ...)
  • reference and pointers types, shared, mut and const
  • tuples
  • arrays, with and without fixed size (e.g. [u8], [u8; 17])
  • structs, enums, closures, and other named types, possibly with their own set of type arguments
  • function types such as fn(&i32) -> u16
  • dyn traits

Basic types are all encoded via a single lower-case letter, like in the Itanium scheme. Named types are encoded as their absolute path (including arguments) like is done for function symbols. Composites like references, tuples, and function types all follow a simple grammar given in the reference-level explanation below. Here are some example manglings to get a general feel of what they look like:

  • std::mem::align_of::<usize>: _RINvNtC3std3mem8align_ofjE
  • std::mem::align_of::<&char>: _RINvNtC3std3mem8align_ofRcE
  • std::mem::align_of::<std::mem::Discriminant>: _RINvNtC3std3mem8align_ofNtNtC3std3mem12DiscriminantE
  • std::mem::align_of::<&mut (&str,())>: _RINvNtC3std3mem8align_ofQTReuEE

There's one more thing we have to take into account for generic functions: The compiler may produce "crate-local" copies of a monomorphization. That is, if there is a function foo<T> which gets used as foo<u32> in two different crates, the compiler (depending on the optimization level) might generate two distinct functions at the LLVM IR level, each with it's own symbol. In order to support this without running into conflicts, symbol names for monomorphizations must include the id of the crate they are instantiated for. This scheme does this by appending an <crate-id> suffix to the symbol. So for example the mangling for std::mem::align_of::<usize> would actually look like this:

_RINvNtC3std3mem8align_ofjEC3foo (for crate "foo")
_RINvNtC3std3mem8align_ofjEC3bar (for crate "bar")

Closures and Closure Environments

The scheme needs to be able to generate symbol names for the function containing the code of a closure and it needs to be able to refer to the type of a closure if it occurs as a type argument. As closures don't have a name, we need to generate one. The scheme proposes to use the namespace and disambiguation mechanisms already introduced above for this purpose. Closures get their own "namespace" (i.e. they are neither in the type nor the value namespace), and each closure has an empty name with a disambiguation index (like for macro hygiene) identifying them within their parent. The full name of a closure is then constructed like for any other named item:

mod foo {
  fn bar(x: u32) {
    let a = |x| { x + 1 }; // local name: NC<...>0
    let b = |x| { x + 2 }; // local name: NC<...>s_0

    a(b(x))
  }
}

In the above example we have two closures, the one assigned to a and the one assigned to b. The first one would get the local name NC<...>0 and the second one the name NC<...>s_0. The 0 signifies the length of their (empty) name. The <...> part is the path of the parent. The C is the namespace tag, analogous to the v tag for the value namespace. The s_ for the second closure is the disambiguation index (index 0 is, again, encoded by not prepending a prefix). Their full names would then be NCNvNtC7mycrate3foo3bar0 and NCNvNtC7mycrate3foo3bars_0 respectively.

Methods

Methods are nested within impl or trait items. As such it would be possible to construct their symbol names as paths like my_crate::foo::{{impl}}::some_method where {{impl}} somehow identifies the impl in question. Since impls don't have names, we'd have to use an indexing scheme like the one used for closures (and indeed, this is what the compiler does internally). Adding in generic arguments to, this would lead to symbol names looking like my_crate::foo::impl'17::<u32, char>::some_method.

However, in the opinion of the author these symbols are very hard to map back to the method they represent. Consider a module containing dozens of types, each with multiple impl blocks generated via #[derive(...)]. In order to find out which method a symbol maps to, one would have to count the number of handwritten and macro generated impl blocks in the module, and hope that one correctly guessed the number of impl blocks introduced by the given derive-macro (each macro invocation can introduce 0..n such blocks). The name of the method might give a hint, but there are still likely to be dozens of methods named clone, hash, eq, et cetera.

The RFC therefore proposes to keep symbol names close to how methods are represented in error messages, that is:

  • <Foo<u32, char>>::some_method for inherent methods, and
  • <Foo<u32, char> as SomeTrait<Quux>>::some_method for trait methods.

This can be achieved by extending the definition of paths that we have used so far. Instead of the path root always being a crate-id, we now also allow a path to start with an impl production that contains the self-type and (for trait methods) the name of the trait being implemented.

Thus, this extended form of paths would have the following syntax:

<path> = C <identifier>                      // crate-id root
       | M <type>                            // inherent impl root
       | X <type> <path>                     // trait impl root
       | N <namespace> <path> <identifier>   // nested path
       | I <path> {<generic-arg>} E          // generic arguments

Here are some examples for complete symbol names:

<mycrate::Foo<u32>>::foo => _RNvMINtC7mycrate3FoomE3foo
<u32 as mycrate::Foo>::foo => _RNvXmNtC7mycrate3Foo3foo
<mycrate::Foo<u32> as mycrate::Bar<u64>>::foo => _RNvXINtC7mycrate3FoomEINtC7mycrate3BaryE3foo

Items Within Generic Impls

In Rust one can define items within generic items, e.g. functions or impls, like in the following example:

struct Foo<T>(T);

impl<T> From<T> for Foo<T> {
  fn from(x: T) -> Self {
    static MSG: &str = "...";
    panic!("{}", MSG)
  }
}

The MSG here (or any other such nested definition) does not inherit the generic context from the impl. MSG is non-generic, and a function defined in its place would be too. The fully qualified name of MSG, according to our examples so far, is thus <mycrate::Foo<_> as std::convert::From<_>>::from::MSG and its symbol name:

_RNvNvXINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG

However, with trait specialization, this symbol can be ambiguous. Consider the following piece of code:

struct Foo<T>(T);

impl<T> From<T> for Foo<T> {
  default fn from(x: T) -> Self {
    static MSG: &str = "...";
    panic!("{}", MSG)
  }
}

impl<T: Default> From<T> for Foo<T> {
  fn from(x: T) -> Self {
    static MSG: &str = "123";
    panic!("{}", MSG)
  }
}

Notice that both MSG statics have the path <Foo<_> as From<_>>::foo::MSG. We somehow have to disambiguate the impls. We do so by adding the path of the impl to the symbol name.

<path> = C <identifier>                      // crate-id root
       | M <impl-path> <type>                // inherent impl root
       | X <impl-path> <type> <path>         // trait impl root
       | N <namespace> <path> <identifier>   // nested path
       | I <path> {<generic-arg>} E          // generic arguments

<impl-path> = [<disambiguator>] <path>

The two symbol names would then look something like:

_RNvNvXs2_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
_RNvNvXs3_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG
       <----------><----------------><----------------------->
        impl-path      self-type            trait-name

Like other disambiguation information, this path would usually not actually be shown by demanglers.

Unicode Identifiers

Rust allows Unicode identifiers but our character set is restricted to ASCII alphanumerics, and _. In order to transcode the former to the latter, we use the same approach as Swift, which is: encode all non-ASCII identifiers via Punycode, a standardized and efficient encoding that keeps encoded strings in a rather human-readable format. So for example, the string

"Gödel, Escher, Bach"

is encoded as

"Gdel, Escher, Bach-d3b"

which, as opposed to something like Base64, still gives a pretty good idea of what the original string looked like.

Each component of a name, i.e. anything that starts with the number of bytes to read in the examples above, is encoded individually. Components encoded this way are augmented with a u prefix so that demanglers know that the identifier needs further decoding. As an example, the function:

mod gödel {
  mod escher {
    fn bach() {}
  }
}

would be mangled as:

_RNvNtNtC7mycrateu8gdel_5qa6escher4bach
                 <-------->
              Unicode component

Compression/Substitution

The length of symbol names has an influence on how much work the compiler, linker, and loader have to perform. The shorter the names, the better. At the same time, Rust's generics can lead to rather long names (which are often not visible in the code because of type inference and impl Trait). For example, the return type of the following function:

fn quux(s: Vec<u32>) -> impl Iterator<Item = (u32, usize)> {
    s.into_iter()
     .map(|x| x + 1)
     .filter(|&x| x > 10)
     .zip(0..)
     .chain(iter::once((0, 0)))
}

is

std::iter::Chain<
  std::iter::Zip<
    std::iter::Filter<
      std::iter::Map<
        std::vec::IntoIter<u32>,
        [closure@src/main.rs:16:11: 16:18]>,
      [closure@src/main.rs:17:14: 17:25]>,
    std::ops::RangeFrom<usize>>,
  std::iter::Once<(u32, usize)>>

It would make for a long symbol name if this type is used (maybe repeatedly) as a generic argument somewhere. C++ has the same problem with its templates; which is why the Itanium mangling introduces the concept of compression. If a component of a definition occurs more than once, it will not be repeated and instead be emitted as a substitution marker that allows to reconstruct which component it refers to. The scheme proposed here will use the same approach (but with a simpler definition).

The exact scheme will be described in detail in the reference level explanation below but it roughly works as follows: As a mangled symbol name is being built, we remember the position of every substitutable item in the output string, that is, we keep track of things a subsequent occurrence of which could be replaced by a back reference.

The things that are eligible for substitution are (1) all prefixes of paths (including the entire path itself), (2) all types except for basic types, and (3) type-level constants (array lengths and values passed to const generic params).

Here's an example in order to illustrate the concept. The name

std::iter::Chain<std::iter::Zip<std::vec::IntoIter<u32>, std::vec::IntoIter<u32>>>

is mangled to the following uncompressed string. The lines below show parts of the mangled string that already occurred before and can thus be replaced by a back reference. The number of at the beginning of each span given the 0-based byte position of where it occurred the first time.

  0         10        20        30        40        50        60        70        80        90
_RINtNtC3std4iter5ChainINtNtC3std4iter3ZipINtNtC3std3vec8IntoItermEINtNtC3std3vec8IntoItermEEE
                            5----              5----                    5----
                          3-----------                                43---------
                                                                    41--------------------
                                                                   40-----------------------

The compiler is always supposed to use the longest replacement possible in order to achieve the best compression. The compressed symbol looks as follows:

_RINtNtC3std4iter5ChainINtB2_3ZipINtNtB4_3vec8IntoItermEBt_EE
                          ^^^         ^^^               ^^^     back references

Back references have the form B<base-62-number>_.

Reference-level explanation

The reference-level explanation consists of three parts:

  1. A specification of the syntax mangled names conform to.
  2. A specification of the compression scheme.
  3. A mapping of Rust entities to the mangling syntax.

For implementing a demangler, only the first two sections are of interest, that is, a demangler only needs to understand syntax and compression of names, but it does not have to care about how the compiler generates mangled names.

Syntax Of Mangled Names

The syntax of mangled names is given in extended Backus-Naur form:

  • Non-terminals are within angle brackets (as in <path>)
  • Terminals are within quotes (as in "_R"),
  • Optional parts are in brackets (as in [<disambiguator>]),
  • Repetition (zero or more times) is signified by curly braces (as in {<type>})
  • Comments are marked with //.

Mangled names conform to the following grammar:

// The <decimal-number> specifies the encoding version.
<symbol-name> =
    "_R" [<decimal-number>] <path> [<instantiating-crate>] [<vendor-specific-suffix>]

<path> = "C" <identifier>                    // crate root
       | "M" <impl-path> <type>              // <T> (inherent impl)
       | "X" <impl-path> <type> <path>       // <T as Trait> (trait impl)
       | "Y" <type> <path>                   // <T as Trait> (trait definition)
       | "N" <namespace> <path> <identifier> // ...::ident (nested path)
       | "I" <path> {<generic-arg>} "E"      // ...<T, U> (generic args)
       | <backref>

// Path to an impl (without the Self type or the trait).
// The <path> is the parent, while the <disambiguator> distinguishes
// between impls in that same parent (e.g. multiple impls in a mod).
// This exists as a simple way of ensure uniqueness, and demanglers
// don't need to show it (unless the location of the impl is desired).
<impl-path> = [<disambiguator>] <path>

// The <decimal-number> is the length of the identifier in bytes.
// <bytes> is the identifier itself, and it's optionally preceded by "_",
// to separate it from its length - this "_" is mandatory if the <bytes>
// starts with a decimal digit, or "_", in order to keep it unambiguous.
// If the "u" is present then <bytes> is Punycode-encoded.
<identifier> = [<disambiguator>] <undisambiguated-identifier>
<disambiguator> = "s" <base-62-number>
<undisambiguated-identifier> = ["u"] <decimal-number> ["_"] <bytes>

// Namespace of the identifier in a (nested) path.
// It's an a-zA-Z character, with a-z reserved for implementation-internal
// disambiguation categories (and demanglers should never show them), while
// A-Z are used for special namespaces (e.g. closures), which the demangler
// can show in a special way (e.g. `NC...` as `...::{closure}`), or just
// default to showing the uppercase character.
<namespace> = "C"      // closure
            | "S"      // shim
            | <A-Z>    // other special namespaces
            | <a-z>    // internal namespaces

<generic-arg> = <lifetime>
              | <type>
              | "K" <const> // forward-compat for const generics

// An anonymous (numbered) lifetime, either erased or higher-ranked.
// Index 0 is always erased (can show as '_, if at all), while indices
// starting from 1 refer (as de Bruijn indices) to a higher-ranked
// lifetime bound by one of the enclosing <binder>s.
<lifetime> = "L" <base-62-number>

// Specify the number of higher-ranked (for<...>) lifetimes to bound.
// <lifetime> can then later refer to them, with lowest indices for
// innermost lifetimes, e.g. in `for<'a, 'b> fn(for<'c> fn(...))`,
// any <lifetime>s in ... (but not inside more binders) will observe
// the indices 1, 2, and 3 refer to 'c, 'b, and 'a, respectively.
// The number of bound lifetimes is value of <base-62-number> + 1.
<binder> = "G" <base-62-number>

<type> = <basic-type>
       | <path>                      // named type
       | "A" <type> <const>          // [T; N]
       | "S" <type>                  // [T]
       | "T" {<type>} "E"            // (T1, T2, T3, ...)
       | "R" [<lifetime>] <type>     // &T
       | "Q" [<lifetime>] <type>     // &mut T
       | "P" <type>                  // *const T
       | "O" <type>                  // *mut T
       | "F" <fn-sig>                // fn(...) -> ...
       | "D" <dyn-bounds> <lifetime> // dyn Trait<Assoc = X> + Send + 'a
       | <backref>

<basic-type> = "a"      // i8
             | "b"      // bool
             | "c"      // char
             | "d"      // f64
             | "e"      // str
             | "f"      // f32
             | "h"      // u8
             | "i"      // isize
             | "j"      // usize
             | "l"      // i32
             | "m"      // u32
             | "n"      // i128
             | "o"      // u128
             | "s"      // i16
             | "t"      // u16
             | "u"      // ()
             | "v"      // ...
             | "x"      // i64
             | "y"      // u64
             | "z"      // !
             | "p"      // placeholder (e.g. for generic params), shown as _

// If the "U" is present then the function is `unsafe`.
// The return type is always present, but demanglers can
// choose to omit the ` -> ()` by special-casing "u".
<fn-sig> = [<binder>] ["U"] ["K" <abi>] {<type>} "E" <type>

<abi> = "C"
      | <undisambiguated-identifier>

<dyn-bounds> = [<binder>] {<dyn-trait>} "E"
<dyn-trait> = <path> {<dyn-trait-assoc-binding>}
<dyn-trait-assoc-binding> = "p" <undisambiguated-identifier> <type>
<const> = <type> <const-data>
        | "p" // placeholder, shown as _
        | <backref>

// The encoding of a constant depends on its type. Integers use their value,
// in base 16 (0-9a-f), not their memory representation. Negative integer
// values are preceded with "n". The bool value false is encoded as `0_`, true
// value as `1_`. The char constants are encoded using their Unicode scalar
// value.
<const-data> = ["n"] {<hex-digit>} "_"

// <base-62-number> uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and
// 'Z' is decimal 61.
// "_" with no digits indicates the value 0, while any other value is offset
// by 1, e.g. "0_" is 1, "Z_" is 62, "10_" is 63, etc.
<base-62-number> = {<0-9a-zA-Z>} "_"

<backref> = "B" <base-62-number>

// We use <path> here, so that we don't have to add a special rule for
// compression. In practice, only a crate root is expected.
<instantiating-crate> = <path>

// There are no restrictions on the characters that may be used
// in the suffix following the `.` or `$`.
<vendor-specific-suffix> = ("." | "$") <suffix>

Namespace Tags

Namespaces are identified by an implementation defined single character tag (the <namespace> production). Only closures (C) and shims (S) have a specific character assigned to them so that demanglers can reliable adjust their output accordingly. Other namespace tags have to be omitted or shown verbatim during demangling.

This is a concession to the compiler's current implementation. While the language only knows two namespaces (the type and the value namespace), the compiler uses many more in some important data structures and disambiguation indices are assigned according to these internal data structures. So, in order not to force the compiler to waste processing time on re-constructing different disambiguation indices, the internal unspecified "namespaces" are used. This may change in the future.

Type-Level Constants

As described above, the grammar encodes constant values via the <const-data> = {<hex-digit>} "_" production, where {<hex-digit>} is the numeric value of the constant, not its representation as bytes. Using the numeric value is platform independent but does not easily scale to non-integer data types.

In the future it is likely that Rust will support complex type-level constants (i.e. not just integers). This RFC suggests to develop a proper mangling for these as part of the future const-generics work, and, for now, only define a mangling for integer values.

Punycode Identifiers

Punycode generates strings of the form ([[:ascii:]]+-)?[[:alnum:]]+. This is problematic because of the - character, which is not in the supported character set; Punycode uses it to separate the ASCII part (if it exists), from the base-36 encoding of the non-ASCII characters.

For this reasons, we deviate from vanilla Punycode, by replacing the - character with a _ character.

Here are some examples:

Original Punycode Punycode + Encoding
føø f-5gaa f_5gaa
α_ω _-ylb7e __ylb7e
铁锈 n84amf n84amf
🤦 fq9h fq9h
ρυστ 2xaedc 2xaedc

With this post-processing in place the Punycode strings can be treated like regular identifiers and need no further special handling.

Vendor-specific suffix

Similarly to the Itanium C++ ABI mangling scheme, a symbol name containing a period (.) or a dollar sign ($) represents a vendor-specific version of the symbol. There are no restrictions on the characters following the period or dollar sign.

This can happen in practice when locally unique names needed to become globally unique. For example, LLVM can append a .llvm.<numbers> suffix during LTO to ensure a unique name, and $ can be used for thread-local data on Mach-O. In these situations it's generally fine to ignore the suffix: the suffixed name has the same semantics as the original.

Compression

Symbol name compression works by substituting parts of the mangled name that have already been seen for a back reference. Compression is directly built into the mangling algorithm, as shown by the following piece of pseudocode:

fn mangle(node, output_string, substitution_dictionary) {
    if let Some(backref) = substitution_dictionary.get(node) {
        // Emit the backref instead of the node's contents
        mangle(backref, output_string)
    } else {
        // Remember where the current node starts in the output
        let start_position = output_string.len()

        // Do the actual mangling, including recursive mangling of child nodes

        // Add the current node to the substitution dictionary
        if node.is_substitutable() {
            substitution_dictionary.insert(node, start_position)
        }
    }
}

This algorithm automatically chooses the best compression because parent nodes (which are always larger) are visited before child nodes.

Note that this kind of compression relies on the fact that all substitutable AST nodes have a self-terminating mangled form, that is, given the start position of the encoded node, the grammar guarantees that it is always unambiguous where the node ends. This is ensured by not allowing optional or repeating elements at the end of substitutable productions.

Decompression

Decompression too is built directly into demangling/parsing. When a back reference is encountered, we decode the referenced position and use a temporary demangler/parser to do the decoding of the node's actual content:

fn demangle_at(&mut pos, mangled, output_string) {
  if is_backref(*pos, mangled) {
    // Read the byte offset of the referenced node and
    // advance `pos` past the backref.
    let mut referenced_pos = decode(pos, mangled);
    demangle_at(&mut referenced_pos, mangled, output_string)
  } else {
    // do regular demangling
  }
}

Using byte offsets as backref keys (as this RFC does) instead of post-order traversal indices (as Itanium mangling does) has the advantage that the demangler does not need to duplicate the mangler's substitution indexing logic, something that can become quite complex (as demonstrated by the compression scheme proposed in the initial version of this RFC).

A Note On Implementing Efficient Demanglers

The mangling syntax is constructed in a way that allows for implementing efficient demanglers:

  • Mangled names contain information in the same order as unmangled names are expected to contain it. Therefore, a demangler can directly generate its output while parsing the mangled form. There is no need to explicitly instantiate the AST in memory.

  • The same is true for decompression. Decompression can be done without allocating memory outside of the stack. Alternatively the demangler can keep a simple array that maps back-ref indices to ranges in the already generated output. When it encounters a <backref> in need of expansion, it can just look up corresponding range and do a simple memcpy.

Parsing, decompression, and demangling can thus be done in a single pass over the mangled name without the need for complex data structures, which is useful when having to implement #[no_std] or C demanglers. (Note that Punycode can complicate decoding slightly because it needs dynamic memory allocation in the general case but it can be implemented with an on-stack buffer for a reasonable maximum supported length).

Mapping Rust Language Entities to Symbol Names

This RFC suggests the following mapping of Rust entities to mangled names:

  • Named functions, methods, and statics shall be represented by a <path> production.

  • Paths should be rooted at the inner-most entity that can act as a path root. Roots can be crate-ids, inherent impls, trait impls, and (for items within default methods) trait definitions.

  • The compiler is free to choose disambiguation indices and namespace tags from the reserved ranges as long as it ascertains identifier unambiguity.

  • Generic arguments that are equal to the default should not be encoded in order to save space.

Drawbacks

Why should we not do this?

  • The current/legacy scheme based on symbol-hashes is flexible in that hashes can be changed at will. That is, the unstable part of the current mangling scheme is nicely contained and does not keep breaking external tools. The danger of breakage is greater with the scheme proposed here because it exposes more information.

Rationale and alternatives

The alternatives considered are:

  1. Keeping the current scheme. It does meet the minimum requirements after all. However, the general consensus seems to be that this leads to situations where people are unpleasantly surprised when they come across (demangled) symbol names in backtraces or profilers.

  2. Keeping the current scheme but cleaning it up by making the non-hash part more consistent and more expressive. Keep the hash part as a safeguard against symbol conflicts and the rest as something just for demangling. The downside of this is that the hash would still not be predictable, and symbols would get rather long if they should contain more human-readable information about generic arguments.

  3. Define a standardized pretty-printing format for things that end up as symbols, and then encode that via Punycode in order to meet the character set restrictions. This would be rather simple. Symbol names would remain somewhat human-readable (but not very, because all separators would be stripped out). But without some kind of additional compression, symbol names would become rather long.

  4. Use the scheme from the previous bullet point but apply the compression scheme described above. We could do this but it wouldn't really be less complex than the scheme proposed by the RFC.

  5. Define a standardized pretty-printing format for things that end up as symbols, compress with zstd (specially trained for Rust symbols) and encode the result as base63. This is rather simple but loses all human-readability. It's unclear how well this would compress. It would pull the zstd specification into the mangling scheme specification, as well as the pre-trained dictionary.

Prior art

One of the major modern mangling schemes with a public specification is the Itanium C++ ABI scheme for C++ which is used by the GCC toolchain. An initial version of this RFC sticked closely to Itanium mangling, however, the latest version only retains the run-length encoding for identifiers and some literals for tagging things like basic types. The Itanium scheme has been criticized for being overly complex, due to its extensive grammar and two separate compression schemes.

The idea of using Punycode for handling of unicode identifiers is taken from the Swift programming language's mangling scheme.

Unresolved questions

Punycode vs UTF-8

During the pre-RFC phase, it has been suggested that Unicode identifiers should be encoded as UTF-8 instead of Punycode on platforms that allow it. GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard time making up their mind about this issue. Here are some interesting points that might influence the final decision:

  • Using UTF-8 instead of Punycode would make mangled strings containing non-ASCII identifiers a bit more human-readable. For demangled strings, there would be no difference.

  • Punycode support is non-optional since some platforms only allow a very limited character set for symbol names. Thus, we would be using UTF-8 on some platforms and Punycode on others, making it harder to predict what a symbol name for a given item looks like.

  • Punycode encoding and decoding is more runtime effort for the mangler and demangler.

  • Once a demangler supports Punycode, it is not much effort to support both encodings. The u identifier prefix tells the demangler whether it is Punycode. Otherwise it can just assume UTF-8 which already subsumes ASCII.

UPDATE: This RFC recommends that Punycode encoded identifiers must be supported by demanglers but that it is up to the compiler implementation (for now) to decide whether to use it for a given platform. This question will have to be revisited if Rust ever wants to define a stable ABI.

Encoding parameter types for function symbols

It has been suggested that parameter types for functions and methods should be encoded in mangled form too. This is not necessary for symbol name uniqueness but it would provide an additional safeguard against silent ABI-related errors where definition and callers of some function make different assumptions about what parameters a function takes. The RFC does not propose to do this because:

  • Rust makes sure this cannot happen via crate metadata,
  • it would make symbol names longer, and
  • only some but not all ABI related errors are caught by the safeguard.

However, a final decision on the topic has not been made yet.

UPDATE: This RFC suggests that parameter types are not encoded into function and method symbols. Symbol names will already get significantly longer due to encoding additional information and the additional safeguard provided against ABI mismatches is less relevant for Rust than it is for other languages that don't have a concept of library/crate metadata.

Appendix A - Suggested Demangling

This RFC suggests that names are demangled to a form that matches Rust syntax as it is used in source code, compiler error messages and rustdoc:

  • Path components should be separated by ::.

  • If the path root is a <crate-id> it should be printed as the crate name. If the context requires it for correctness, the crate disambiguator can be printed too, as in, for example, std[a0b1c2d3]::collections::HashMap. In this case a0b1c2d3 would be the disambiguator. Usually, the disambiguator can be omitted for better readability.

  • If the path root is an impl, it should be printed as <SelfType> (for inherent impls) or <SelfType as Trait> (for trait impls), like the compiler does in error messages. The <impl-path> also contained in the AST node should usually be omitted.

  • The list of generic arguments should be demangled as <T1, T2, T3>.

  • Identifiers can have a numeric disambiguator (the <disambiguator> production). The syntactic version of the numeric disambiguator maps to a numeric index. If the disambiguator is not present, this index is 0. If it is of the form s_ then the index is 1. If it is of the form s<base-62-digit>_ then the index is <base-62-digit> + 2. The suggested demangling of a disambiguator is [<index>]. However, for better readability, these disambiguators should usually be omitted in the demangling altogether. Disambiguators with index zero can always be omitted.

    The exception here are closures. Since these do not have a name, the disambiguator is the only thing identifying them. The suggested demangling for closures is thus {closure}[<index>].

Appendix B - Examples

We assume that all examples are defined in a crate named mycrate[1234].

Free-standing Item

mod foo {
  mod bar {
    fn baz() {}
  }
}
  • unmangled: mycrate::foo::bar::baz
  • mangled: _RNvNtNtCs1234_7mycrate3foo3bar3baz

Item Defined In Inherent Method

struct Foo<T>(T);

impl<T> Foo<T> {
  pub fn bar<U>(_: U) {
    static QUUX: u32 = 0;
    // ...
  }
}
  • unmangled: <mycrate::Foo<_>>::bar::QUUX
  • mangled: _RNvNvMCs1234_7mycrateINtCs1234_7mycrate3FoopE3bar4QUUX

Item Defined In Trait Method

struct Foo<T>(T);

impl<T> Clone for Foo<T> {
  fn clone(&self) -> Self {
    static QUUX: u32 = 0;
    // ...
  }
}
  • unmangled: <mycrate::Foo<_> as std::clone::Clone>::clone::QUUX
  • mangled: _RNvNvXCs1234_7mycrateINtCs1234_7mycrate3FoopENtNtC3std5clone5Clone5clone4QUUX

Item Defined In Initializer Of A Static

pub static QUUX: u32 = {
  static FOO: u32 = 1;
  FOO + FOO
};
  • unmangled: mycrate::QUUX::FOO
  • mangled: _RNvNvCs1234_7mycrate4QUUX3FOO

Compressed Prefix Constructed From Prefix That Contains A Substitution Itself - TODO

  • unmangled: mycrate::foo<mycrate::bar,mycrate::bar::baz>
  • mangled: _RINvCs1234_7mycrate3fooNvB4_3barNvBn_3bazE

Progressive type compression

  • unmangled: std::foo<(std::Bar,std::Bar),(std::Bar,std::Bar)>
  • mangled: _RINxC3std3fooTNyB4_3BarBe_EBd_E

Appendix C - Change LOG

  • Removed mention of Itanium mangling in introduction.
  • Weakened "predictability" goal.
  • Removed non-goal of not providing a mangling for lifetimes.
  • Added non-goal for not trying to standardize the demangled form.
  • Updated specification and examples to new grammar as proposed by eddyb.
  • impl disambiguation strategy changed to using the impl path instead of param bounds.
  • Updated prior art section to not say this RFC is an adaptation of Itanium mangling.
  • Updated compiler's expected assignment of disambiguation indices and namespace tags.
  • Removed "complexity" drawback since the scheme is not very complex anymore.
  • Removed unresolved question "Re-use <disambiguator> for crate disambiguator".
  • Added note about default generic arguments to reference-level-explanation.
  • Added note about Punycode making decoding more complicated.
  • Resolve question of complex constant data.
  • Add a recommended resolution for open question around Punycode identifiers.
  • Add a recommended resolution for open question around encoding function parameter types.
  • Allow identifiers to start with a digit.
  • Make <binder> optional in <fn-sig> and <dyn-bounds> productions.
  • Extend <const-data> to include bool values, char values, and negative integer values.
  • Remove type from constant placeholders.
  • Allow vendor-specific suffixes.