-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic Integers V2: It's Time #3686
base: master
Are you sure you want to change the base?
Conversation
…p size/alignment to a multiple of 64 bits.
Fix some nits
…eric integers since that's not an issue any more
This reverts commit 25f85cc105cb04b4e87debf46f4547240c122ae4.
As much as I dislike 👍 from me |
Even if we should probably leave them out of the initial RFC for complexity reasons, I would just cheat with floats, as they rely on system libraries and hardware instructions way more than regular integers. By that, I mean that I'd allow |
Are you proposing delaying the discussion or the implementation? My understanding is that with a release early 2025, Rust 2024 will be done by mid November, which is only 2 months away, and it seems quite unlikely this RFC would be accepted and implementation ready to start by then, so I see no conflict with regard to starting on the implementation... ... but I could understand a focus on the edition for the next 2 months, and thus less bandwidth available for discussing RFCs. |
The problem with this approach is that any "cheating" becomes permanently stabilised, and thus, it's worth putting in some thought for the design. This isn't to say that Plus, monomorphisation-time errors were actually one of the big downsides to the original RFC, and I suspect that people haven't really changed their thoughts since then. Effectively, while it's okay to allow some of edge-case monomorphisation-time errors like this RFC includes (for example, asking for One potential solution that was proposed for unifying
And it would support all float types, forever, and there would be no invalid values for
As stated: yes, RFCs take time to discuss and implement and it's very reasonable to expect people to focus on the 2024 edition for now. However, that doesn't mean that we can't discuss this now, especially since there are bound to be things that were missed that would be good to point out. |
|
||
In general, operations on `u<N>` and `i<N>` should work the same as they do for existing integer types, although the compiler may need to special-case `N = 0` and `N = 1` if they're not supported by the backend. | ||
|
||
When stored, `u<N>` should always zero-extend to the size of the type and `i<N>` should always sign-extend. This means that any padding bits for `u<N>` can be expected to be zero, but padding bits for `i<N>` may be either all-zero or all-one depending on the sign. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify this to say what exactly happens when I transmute e.g. 255u8
to u<7>
(and similar to i<N>
). I assume it is UB, i.e., the validity invariant of these types says that the remaining bits are zero-extended / sign-extended, but the RFC should make that explicit.
Note that calling this "padding" might be confusing since "padding" in structs is uninitialized, but here padding would be defined to always have very specific values. (That would, e.g. allow, it to be used as a niche for enum optimizations.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not quite sure what a better name is; it's the same as rustc_layout_scalar_valid_range
, which is UB if the bits are invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that since this is the reference description, calling them niche bits would be more appropriate? Would that feel reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Niche bits are an implementation detail of the enum layout algorithm, and mostly not stable nor documented.
Just describe what the valid representations of values of these type are, i.e., what should go into this section about these types.
|
||
The compiler should be allowed to restrict `N` even further, maybe even as low as `u16::MAX`, due to other restrictions that may apply. For example, the LLVM backend currently only allows integers with widths up to `u<23>::MAX` (not a typo; 23, not 32). On 16-bit targets, using `usize` further restricts these integers to `u16::MAX` bits. | ||
|
||
While `N` could be a `u32` instead of `usize`, keeping it at `usize` makes things slightly more natural when converting bits to array lengths and other length-generics, and these quite high cutoff points are seen as acceptable. In particular, this helps using `N` for an array index until [`generic_const_exprs`] is stabilized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean "using N for an array length", I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
|
||
As an example, someone might end up using `u<7>` for a percent since it allows fewer extraneous values (`101..=127`) than `u<8>` (`101..=255`), although this actually just overcomplicates the code for little benefit, and may even make the performance worse. | ||
|
||
Overall, things have changed dramatically since [the last time this RFC was submitted][#2581]. Back then, const generics weren't even implemented in the compiler yet, but now, they're used throughout the Rust ecosystem. Additionally, it's clear that LLVM definitely supports generic integers to a reasonable extent, and languages like [Zig] and even [C][`_BitInt`] have implemented them. A lot of people think it's time to start considering them for real. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say Zig has generic integers, it seems like they have arbitrarily-sized integers. Or is it possible to write code that is generic over the integer size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well actually you can
const std = @import("std");
fn U(comptime bits: u16) type {
return @Type(std.builtin.Type {
.Int = std.builtin.Type.Int {
.signedness = std.builtin.Signedness.unsigned,
.bits = bits,
},
});
}
pub fn main() !void {
const a: U(2) = 1;
const b: U(2) = 3;
// const c: U(2) = 5; // error: type 'u2' cannot represent integer value '5'
const d = std.math.maxInt(U(147));
std.debug.print("a={}, b={}, d={}", .{ a, b, d });
// a=1, b=3, d=178405961588244985132285746181186892047843327
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that example is satisfactory enough, @RalfJung? Not really sure if it's worth the effort to clarify explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, neat.
C and LLVM only have concrete-width integers though, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, C doesn't have generic anything, so, I guess you're right. Not 100% sure the distinction is worth it.
I love this! One point that is touched upon here is aliases for I think that'd be super valuable to have. Rust already has a lot of symbols and being able to not use the angle brackets makes sure that the code is much calmer to look upon. It's also not the first explicit syntax sugar since an Having the aliases also allows for this while keeping everything consistent: fn foo<const N: usize>(my_num: u<N>) { ... }
foo(123); // What is the bit width? u32 by default?
foo(123u7); // Fixed it |
I agree with you, just didn't want to require them for the initial RFC, since I wanted to keef it simple. Ideally, the language will support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the last RFC was postponed, the stated reason was waiting for pure library solutions to emerge and letting the experience with those inform the design. I don't really see much of this in the current RFC, so here's a bunch of questions about it. It would also be great if some non-obvious design aspects of the RFC (such as limits on N
, whether and how post-monomorphization errors work, padding, alignment, etc.) could be justified with experience from such libraries.
|
||
This was the main proposal last time this RFC rolled around, and as we've seen, it hasn't really worked. | ||
|
||
Crates like [`u`], [`bounded-integer`], and [`intx`] exist, but they come with their own host of problems: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, bounded-integer
and intx
only provide subsets of the native types up to {i,u}128, not arbitrarily large fixed-size integers. The u
crate seems to be about something else entirely, did you mean to link something different there?
So where are the libraries that even try to do what this RFC proposes: arbitrary number of bits, driven by const generics? I've searched and found ruint, which appears relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That definitely seems like a good option to add to the list. I had trouble finding them, so, I appreciate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd appreciate a mention of https://crates.io/crates/arbitrary-int
, which is (I think) the closest in design to this rfc
|
||
Crates like [`u`], [`bounded-integer`], and [`intx`] exist, but they come with their own host of problems: | ||
|
||
* None of these libraries can easily unify with the existing `uN` and `iN` types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A const-generic library type can't provide this and also can't support literals. But what problems exactly does that cause in practice? Which aspects can be handled well with existing language features and which ones really need language support?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC already mentions how being able to provide a small number of generic impls that cover all integer types has an extremely large benefit over being forced to use macros to implement for all of them individually. You cannot do this without language support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this bullet point is "only" about impls like impl<const BITS: usize> Foo for some_library::Int<BITS> { ... }
not implementing anything for the primitive integer types? Could From
impls and some form of delegation (#3530) also help with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, and this is mentioned in the RFC also. That's 5 impls for unsigned, 5 impls for signed that could just be 2 impls, whether you have delegation or not. Even for simple traits, like Default
, you're incentivised to use a macro just because it becomes so cumbersome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrary-int
provides a unification somewhat using its Number
trait. It's somewhat rudimentary but I am working on improving it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this again, the Number trait fulfills a somewhat different role though. It allows writing generic code against any Number (be it an arbitrary-int or a native int), but it does not expose the bits itself - which can be a plus or a minus, depending on what you're building.
Crates like [`u`], [`bounded-integer`], and [`intx`] exist, but they come with their own host of problems: | ||
|
||
* None of these libraries can easily unify with the existing `uN` and `iN` types. | ||
* Generally, they require a lot of unsafe code to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of unsafe code, and for what purposes? And is that sufficient reason to extend the language? Usually, if it's something that can be hidden behind a safe abstraction once and for all, then it seems secondary whether that unsafety lives on crates.io, in sysroot crates, or in the functional correctness of the compiler backend.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, the unsafe code is stuff similar to the bounded-integer
crate, where integers are represented using enums and transmuted from primitives. The casting to primitives is safe, but not the transmuting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really all? Because that seems trivial to encapsulate without affecting the API, and likely to be solved by any future feature that makes it easier to opt into niche optimizations (e.g., pattern types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's easy to encapsulate, but I think it's worth mentioning that unsafe code is involved as a negative because it means many code bases will be more apprehensive to use it.
You are right that it could easily be improved, though, with more compiler features. I just can't imagine it ever being on par with the performance of a compiler-supported version, both at runtime and compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arbitrary-int works without unsafe code (with the exception of the optional function new_unchecked
which skips the bounds check)
|
||
* None of these libraries can easily unify with the existing `uN` and `iN` types. | ||
* Generally, they require a lot of unsafe code to work. | ||
* These representations tend to be slower and less-optimized than compiler-generated versions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any data on what's slower and why? Are there any lower-stakes ways to fix these performance issues by, for example, adding/stabilizing suitable helper functions (like rust-lang/rust#85532) or adding more peephole optimizations in MIR and/or LLVM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main source of slowdown is from using enums to take advantage of niche optimisations; having an enum with a large number of variants to represent this niche is pretty slow to compile, even though most of the resulting code ends up as no-ops after optimisations.
I definitely should mention that I meant slow to compile here, not slow to run. Any library solution can be made fast to run, but will generally suffer in compile time when these features are effectively already supported by the compiler backends, mostly for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any compile time issue when not trying to provide niches? Out of the potential use cases the RFC lists, only a couple seem to really care about niche optimizations. In particular, I don't expect that it typically matters for integers larger than 128 bits. (But again, surveying the real ecosystem would help!) If so, the compile time problem for crates like bounded-integer could be addressed more directly by stabilizing a proper way to directly opt into niches instead of having to abuse enums. And that would help with any bounds, while this RFC (without future possibilities) would not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I would expect some negative compile-time impact from repeatedly monomorphizing code that's const-generics over bit width or bounds. But that's sort of inherent in having lots of code that is generic in this way, so it's no worse for third party libraries than for something built-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very fair; I agree that we should have an ability to opt into niches regardless. I guess that my reasoning here is pretty lackluster because I felt that the other reasons to have this feature were strong enough that this argument wasn't worth arguing, although you're right that I should actually put a proper argument for it.
From what I've seen, of the use cases for generic integers:
- Generalising primitives
- Between-primitives integer types (like
u<7>
andu<48>
) - Larger-than-primitives integer types
For 1, basically no library solution can work, so, that's off the table. For 2, which is mostly the subject of discussion here, you're right that it could probably be improved a lot with existing support. And for 3, most people just don't find the need to make generalised code for their use cases, and just explicitly implement, say, u256
themselves with the few operations they need.
The main argument IMHO is that we can effectively knock out all three of these options easily with generic integers supported by the language, and they would be efficient and optimized by the compiler. We can definitely whittle down the issues with 2 and 3 as we add more support, but the point is that we don't need to if we add in generic integers.
Although, I really need to solidify this argument, because folks like you aren't 100% convinced, and I think that the feedback has been pretty valuable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I appreciate that you're trying to tackle a lot of different problems with a unifying mechanism. I focus on each problem separately because I want to tease out how much value the unifying mechanism adds for each of them, compared to smaller, more incremental additions that may be useful and/or necessary in any case. Only when that's done I feel like I can form an opinion on whether this relatively large feature seems worth it overall.
* None of these libraries can easily unify with the existing `uN` and `iN` types. | ||
* Generally, they require a lot of unsafe code to work. | ||
* These representations tend to be slower and less-optimized than compiler-generated versions. | ||
* They still require you to generalise integer types with macros instead of const generics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the problem here. If a library provides struct Int<const BITS: usize>(...);
then code using this library shouldn't need macros to interact with it (except, perhaps, as workaround for current gaps in const generics). The library itself would have a bunch of impls relating its types to the language primitives, which may be generated with macros. But that doesn't seem like such a drastic problem, if it's constrained to the innards of one library, or a few competing libraries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand your argument. No matter what, a library solution cannot be both generic and unify with the standard library types. I don't see a path forward that would allow, for example, some library Uint<N>
type to allow Uint<8>
being an alias for u8
while also supporting arbitrary Uint<N>
. Even with specialisation, I can't imagine a sound subset of specialisation allowing this to work.
Like, sure, a set of libraries can choose to only use these types instead of the primitives, circumventing the problem. But most people will want to implement their traits for primitives for interoperability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overlaps a bit with the bullet point about unification, but I do think it depends a lot on what one is doing. For example, the num-traits crate defines traits that it needs to implement for the primitive types. On the other hand, any code that's currently written against the traits from num-traits may be happy with a third party library that provides Int<N>
and Uint<N>
and implements the relevant traits for them. And for something like bit fields, you may not need much generalization over primitive types at all: in the MipsInstruction
example, you probably want some widening and narrowing conversions, but only with respect to u32 specifically.
It's hard to form an opinion about how common these scenarios are (and whether there are other nuances) without having a corpus of "real" code to look at. Experience reports (including negative ones) with crates like num-traits and bounded-integer may be more useful than discussing it in the abstract.
Two things that came to mind:
|
So, I agree that this was one of the reasons, but it's worth reiterating that also, at that time, const generics weren't even stable. We had no idea what the larger ecosystem would choose to do with them, considering how many people were waiting for stabilisation to really start using them. (We had an idea of what was possible, but not what would feel most ergonomic for APIs, etc.) So, I personally felt that the library solution idea was mostly due to that fact that we didn't really know what libraries would do with const generics. And, overwhelmingly, there hasn't been much interest in it for what I believe to be the most compelling use case: generalising APIs without using macros, which right now cannot really be done without language support. |
By itself, the observation that the ecosystem isn't rushing to write and use such libraries doesn't tell us much. What if there just isn't much demand for generalizing code like that? On the other hand, if people have tried to write this code, and have run into concrete problems that they can't solve with more library code, then that is useful information for extending the language. Our discussion above has already focused on this. In addition, if there's any practical experience with API design, regardless of QoI problems like runtime performance / compile time / compiler diagnostics, then it would be useful to incorporate that into the RFC. |
I mean, ultimately, the compiler can pretty easily deal with this kind of self-referencing by nature of what it is. Whenever the compiler even sees It would be an issue to have, for example,
I mean, even C is adding these now as features, so, I think that regardless of whether the compilers are robust enough, they will be robust enough by the time an implementation in Rust comes to fruition. |
|
||
The primary reason for leaving this out is… well, it's a lot different from the existing integer types in the language. Additionally, such a proposal can coexist with this one, since there's nothing stopping us from making an `int<A..=B>` type in the future such that `u<N>` and `i<N>` are equivalent to the correct range of integers. | ||
|
||
## Integer traits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it would be a good alternative for the standard library to provide such traits. However, they do exist in the crates.io ecosystem and are reasonably popular (mainly num-traits). Insofar the RFC is motivated by letting people generalize their code over the existing primitive types, it would benefit from engaging with those traits. There's definitely things a language feature can do better, e.g., impl<N> Foo for u<N>
is less likely to cause coherence problems than impl<T: Int> Foo for T
. On the other hand, the existing traits do solve some problems, and not all drawbacks listed in this section apply to them. Libraries can iterate on "which methods are in which trait" and "what traits even exists" much more easily than std because semver-breaking changes are possible for them. It's also possible to have competing traits with overlapping sets of methods, which is mitigated by std providing inherent impls (though there can still be ambiguities: rust-lang/rust#68826).
That's not true. |
how does this interact with enum layout optimizations? can |
that |
wrong but left for posterity
So, it's slightly more complicated than that; -128i8 and -64i7 have the same bit pattern in memory: all ones. However, you're right that there are values in there that don't, like positive 64 (which is So, really, the valid ranges ranges are, assuming we're treating the number as an
This is where I definitely should have been more careful in my wording: I meant that the backend doesn't distinguish them, but you're right that the type system needs to distinguish them all the way until the code has all been generated. My statement was basically wrong as-is. |
Why do they even have that lever? |
you need to check your math, |
You're right 🤦🏻 |
I want to chime in here and link to another library: https://github.com/danlehmann/arbitrary-int I would say the main painpoints of using a library here are all the conversions between different number sizes (and in the case of bitfields, structs) and the ergonomics (e.g. hecatia-elegua/bilge#75). Another nice byproduct of this RFC would be the constification of conversions between arbitrary ints by design. For some time I used the unstable const-trait-impl with const-from for converting between things, but that's being reworked now and not currently available. |
|
||
This will cause the compiler to fail when `MyDefault` is used for `u<0>` or `u<1>`, since it will force the constant block to be evaluated. Not ideal, but it's the best we've got for now. | ||
|
||
## Uncommonly sized integers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that this immediately makes me think about is i<0>
-- is that allowed? What value, if any, does it have?
(u<0>
is easy, but i<0>
doesn't really make any sense given that i32
includes the signed bit in those 32
, and there can't be any sign bit in zero bits.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are two answers. Their order was chosen by an unbiased coin flip.
Obviously, the value should be negative one, because two's complement always has one more negative number than positive numbers.
Obviously, the value should be zero, because i<N>
and u<N>
always have this value in common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less tongue-in-cheek discussion about why/whether i0 should be permitted (in certain contexts) and what its value should be: https://discourse.llvm.org/t/rfc-support-zero-width-integers-in-the-comb-dialect-circt/78492
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or i<0>
could be an uninhabitable type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it was uninhabited, every possible way of constructing an i<N>
needs to special case N = 0, either with bounds that rule it out (not properly supported yet, and likely implies an annotation burden) or by panicking. That sounds even more inconvenient than a post-mono error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if you just missed this part, but I explicitly clarified the value of i<0>
with this:
u<N>
are able to store integers in the range0..2.pow(N)
andi<N>
are able to store integers in the range-2.pow(N-1)..2.pow(N-1)
. The cheeky specificity of "integers in the range" ensures that, fori<0>
, the range-0.5..0.5
only contains the integer zero; in general,u<0>
andi<0>
will need to be special-cased anyway, as they must be ZSTs.
This felt like the most logical approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also a side note on uninhabited types: I mention later that it's an unanswered question whether NonZero<u<0>>
would be allowed and uninhabited, but my assumption would be yes. So, it makes sense that the weird special-casing of zero would be reserved to the types that explicitly call out zero as a special case.
|
||
## Basic semantics | ||
|
||
The compiler will gain the built-in integer types `u<N>` and `i<N>`, where `const N: usize`. These be identical to existing `uN` and `iN` types wherever possible, e.g. `u<8> == u8`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for clarity, it'd be good to also include i32 ≡ i<32>
as an example too, to clarify that it's not i32 ≡ i<31>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind adding this, but I'm confused how you could come to the conclusion that i32 == i<31>
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe if you thought i<N>
excluded the sign bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would it, though, if i32
is including the sign bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because i<0>
having a sign bit and negative one non-sign bits is awkward, and thus you might assume a different base case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, I guess that makes sense, but it feels like a lot of extra steps and assumptions compared to how the type is defined. i<0>
has zero bits, therefore, it also has no sign bit. So, the only value it's capable of representing is one where the sign doesn't matter, which is zero.
Plus, again, i32
also has 32 bits, and we're including the sign there.
I will update the RFC when I have the time to sit down and respond to all the feedback, but I would like to understand things a bit more so I can help make sure folks aren't confused.
* from `bool` to `u<N>` or `i<N>` | ||
* from `char` to `u<N>` or `i<N>` | ||
* from `u<1>` to `bool` | ||
* from `u<N>` to `char`, where `N < 16` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this important? I suppose it's possible, since 0xD800 is the first problem, but feels a bit awkward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, it's an extension of u8
being castable to char
: since that's also mostly an arbitrary cutoff point-- it might be better to only say that N <= 7
can be cast to char, since that's actually the bounds of ASCII-- but we already support u8
, so, I decided to just make it go to the maximum possible cutoff point.
|
||
`u<N>` are able to store integers in the range `0..2.pow(N)` and `i<N>` are able to store integers in the range `-2.pow(N-1)..2.pow(N-1)`. The cheeky specificity of "integers in the range" ensures that, for `i<0>`, the range `-0.5..0.5` only contains the integer zero; in general, `u<0>` and `i<0>` will need to be special-cased anyway, as they must be ZSTs. | ||
|
||
It's always valid to `as`-cast `u<N>` or `i<N>` to `u<M>` or `i<M>`, and the usual sign extension or truncation will occur depending on the bits involved. A few additional casts which are possible: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pondering: maybe only allow as
for the lossless & value-preserving ones? Then the truncating ones can be via a named method that makes that clearer.
Or maybe offer a function for arbitrary N-to-M that in debug mode will panic but will wrap in release, for generic code to use in places where we don't yet have the type system functionality to constrain N ≤ M
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pondering: maybe only allow
as
for the lossless & value-preserving ones? Then the truncating ones can be via a named method that makes that clearer.
you can already do 0x100u16 as u8
, so why not generalize that to all sizes for consistency? restricting as
to lossless & value-preserving casts is a breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the main reason is backwards compatibility here: since we can already do truncating conversions, I don't want to special-case it that you can only do that for explicitly N = 8, 16, 32, 64, 128
and instead just allowed it everywhere. If we have a change in a future edition where only lossless conversions can be as-casted, then, we can change this behaviour, but I'd rather go with consistency for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we already have Into for infallible lossless casts.
|
||
The compiler should be allowed to restrict `N` even further, maybe even as low as `u16::MAX`, due to other restrictions that may apply. For example, the LLVM backend currently only allows integers with widths up to `u<23>::MAX` (not a typo; 23, not 32). On 16-bit targets, using `usize` further restricts these integers to `u16::MAX` bits. | ||
|
||
While `N` could be a `u32` instead of `usize`, keeping it at `usize` makes things slightly more natural when converting bits to array lengths and other length-generics, and these quite high cutoff points are seen as acceptable. In particular, this helps using `N` for an array index until [`generic_const_exprs`] is stabilized. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feels like it's still awkward? The usual array for u32
is not [_; 32]
but [_; 4]
, which still needs computation on the generic.
I'd rather have the generic work with things like .checked_shr(B - 1)
, which would mean u32
, not usize
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it's absolutely still awkward. I guess that the justification is pretty weak here, although another one is the fact that usize
means you don't have the extra limit at isize::MAX
bytes for 16-bit targets.
And as always, you can easily cast the const generic in expressions, but it's harder to do so when you want to reuse it as another generic. But I get the argument that you probably won't want to reuse it anyway, and always will need some sort of computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any realistic use case for anything close to u<i16::MAX>
, even on 64-bit targets, so whatever integer type makes typical usage work without casts would be best. While arithmetic operations can be extrapolated to any bit width, once you're past several thousand bits a primitive integer type stops working well. Just use a "proper" bitint library, or write your bit vector ops in terms of [u64; N]
. It's true that having a post-mono error right above N = 128 is a hassle, but there has to be some limit and it should be well below the point where LLVM grinds to a halt because it legalized every bitwise operation into many thousands of word-sized operations and such large basic blocks trigger pathological compile-time performance all over the place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with you. It's actually why I mentioned that u16::MAX
(or indeed, i16::MAX
) is probably the maximum we should guarantee that's portable, since even at that point you're much better off with a bigint type. 32k is a ridiculously large amount of bits for one number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Script:
const v: uX = 0;
std.debug.print("{}", .{v});
bit-width | compile time | execution size(kb) |
---|---|---|
64 | 5.995797872543335 | 767.0 |
1024 | 6.129250764846802 | 841.0 |
2048 | 6.318152189254761 | 965.0 |
3072 | 6.924046754837036 | 1138.0 |
4096 | 7.588716268539429 | 1343.0 |
5120 | 10.228423833847046 | 1622.0 |
6144 | 14.128804206848145 | 1896.5 |
7168 | 20.459102153778076 | 2252.5 |
8192 | 29.913185834884644 | 2632.5 |
For reference, in a debug build of Zig without cache in my Windows11
, the compile time increases drastically when bit width is larger than 4096
.
So I think the maximum could be even much smller for better experience under current LLVM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
imo 512 should be more than enough for a MVP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zig has 65535 as the maximum bit width, clang 18 on x86_64 has 8388608.
|
||
## Standard library | ||
|
||
The existing macro-based implementation for `uN` and `iN` should be changed to implement for only `u<N>`, `i<N>`, `usize`, and `isize` instead; this has already been implemented in a mostly-generic way and should work as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing that might be worth discussing is whether it's worth trying to lint against things with slightly awkward behaviour for different widths. For example, ∀N ≥ 10, u<N>::log2(1000) == 9
, but u<N>::count_zeros(1000)
gets bigger as N
gets bigger, and thus code working on unusual or generic N
might want to use count_ones
instead, which doesn't have that problem for u<N>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point; I'll add a note on using these methods for a potential clippy lint. As I think I mentioned in the RFC, clippy lints will probably be the best starting point for this since it's a good incubation ground while we're not quite sure what lints will be the most useful.
* `From` and `TryFrom` implementations (requires const-generic bounds) | ||
* `from_*e_bytes` and `to_*e_bytes` methods (requires [`generic_const_exprs`]) | ||
|
||
Currently, the LLVM backend already supports generic integers (you can refer to `iN` and `uN` as much as you want), although other backends may need additional code to work with generic integers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to emphasize here: getting u128
to work was a huge endeavour, and bigger ones will be even harder for things like division -- even for 128-bit it calls out to a specific symbol for that.
Embarassingly-parallel things like BitAnd
or count_ones
are really easy to support for bigger widths, but other things might be extremely difficult, so it might be worth exploring what it would look like to allow those only for N ≤ 128
or something, initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to emphasize here: getting
u128
to work was a huge endeavour, and bigger ones will be even harder for things like division -- even for 128-bit it calls out to a specific symbol for that.
LLVM has a pass specifically for expanding large divisions into a loop that doesn't use a libcall, so that shouldn't really be an issue though libcalls can still be added if you want something faster: llvm/llvm-project@3e39b27
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as part of clang gaining support for _BitInt(N)
where N > 128
, basically all the work to make it work has already been done in LLVM. Div/Rem was the last missing piece and that was added in 2022.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clang still limit _BitInt(N)
to N <= 128 on quite a few targets: https://gcc.godbolt.org/z/8P3sMjavs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think that's merely because they haven't got around to defining the ABI, but it all works afaict: https://llvm.godbolt.org/z/88K3ox7bh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also worth mentioning that having N > 128
be a post-monomorphisation error was seen as one of the biggest downsides to the previous RFC, and that this would cause more of a headache than just trying to make it work in general. [citation needed]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good start. However, as long as Clang isn't shipping it, the people working on Clang aren't discovering and fixing any bugs specific to those platforms. The div/rem lowering happens in LLVM IR so it's hopefully pretty target-independent, but most other operations are still legalized later in the backends. That includes the operations the div/rem lowering relies on, but also any other LLVM intrinsics that the standard library uses or may want to use in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of relying a lot on the fact that even though not everyone is using _BitInt(N)
right now, by the time we actually would be stabilising this RFC, LLVM would be a lot more robust in that regard. Kind of a role reversal from what happened with 128-bit integers: back then, we were really pushing LLVM to have better support, and C benefited from that, but now, C pushing LLVM to have better support will benefit Rust instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you say, this can be revisited later, but note that there's no guarantee that Clang will ever support _BitInt(129)
or larger on any particular target. The C standard only requires BITINT_MAXWIDTH >= ULLONG_WIDTH
. If some target keeps it at 128 for long enough, it could become entrenched enough that nobody wants to risk increasing it (e.g., imagine people putting stuff like char bits[BITINT_MAXWIDTH / 8];
in some headers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually had no idea that was how the standard worked, but I shouldn't really be surprised, considering how it's C. :/
Hey, I'm the author of https://crates.io/crates/arbitrary-int . It seems like this proposal has some overlap with what I've built as a crate, so I can talk a bit about the hurdles I've run into. Arbitrary-int has a generic type It also provides types to shorten the name. For example It also provides a In general, implementing this as a crate worked pretty well, but there are some downsides:
|
[#2581]: https://github.com/rust-lang/rfcs/pull/2581 | ||
[Zig]: https://ziglang.org/documentation/master/#Primitive-Types | ||
|
||
# Rationale and alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me, the biggest reason to go this way is the coherence possibilities. I'd propose something like
# Rationale and alternatives | |
# Rationale and alternatives | |
## Coherence | |
One problem with other ways of doing this is that anything trait-based will run afoul of coherence in user code. | |
For example, if I tried to `impl<T> MyTrait for T where T: UnsignedInteger`, then it takes extra coherence logic -- which doesn't yet exist -- to also allow implementing `MyTrait` for other things. And this is worse if you want blankets for both `T: SignedInteger` and `T: UnsignedInteger` -- which would need like mutually-exclusive traits or similar. | |
When user code does | |
``` | |
impl<const N: u32> MyTrait for u<n> { … } | |
impl<const N: u32> MyTrait for i<n> { … } | |
``` | |
those are already-distinct types to coherence, no different from implementing a trait for both `Vec<T>` and `VecDeque<T>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would likely go in the motivation section rather than the rationale section, but I agree with you that this is a good argument to mention. Will have to ponder where exactly it fits in the RFC.
Also due to my design decision to base everything on a simple types (no arrays), the maximum number of bits supported is u127. |
I hadn't actually read the code yet, but I'm actually a bit curious why the max number of bits is 127 instead of 128. This feels like a weird restriction. |
It is 128 bits actually. |
By the way, I love this RFC! While arbitrary-int (as well as ux) provide the unusually-sized ints like u48 etc, having a built-in solution will feel more natural and allows to treat numbers in a much more unified fashion, which I'm looking forward to. |
Summary
Adds the builtin types
u<N>
andi<N>
, allowing integers with an arbitrary size in bits.Rendered
Details
This is a follow-up to #2581, which was previously postponed. A lot has happened since then, and there has been general support for this change from a lot of different people. It's time.
There are a few key differences from the previous RFC, but I trust that you can read.
Thanks
Thank you to everyone who responded to the pre-RFC on Internals with feedback.