-
Notifications
You must be signed in to change notification settings - Fork 525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strings RFC #1848
Strings RFC #1848
Conversation
fn as_wide(&self) -> &[u16] {} | ||
// String data with the trailing 0 | ||
fn as_wide_with_nul(&self) -> &[u16] {} | ||
fn to_string(&self) -> Result<String, FromUtf16Error> {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: This function collides with ToString::to_string
, since HSTRING
implements Display
below.
Perhaps this could be renamed try_to_string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. That's a shame. We could use into_string
which mirrors CString::into_string
directly. I originally didn't do this as it's not exactly trivial to reuse the String
's buffer for the HSTRING
buffer and thus I didn't want to take self
.
|
||
/// The wide equivalent to std::ffi::CString; | ||
#[repr(transparent)] | ||
struct CWString(..); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should there be a matching borrowed CWStr
, like CString
and CStr
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps... I'm not sure how often this will be useful in practice though it might be. PCWSTR
would just be like a raw pointer so ideally the user would only rarely use it, but I'm unsure how often you'd want a &CWStr
and not a &CWString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rylev
One reason would be so there isn't a double-indirection.
It can also be used to convert a PCWSTR
into a safely-used reference without having to copy or own the data. (See for example CStr::from_ptr
)
|
||
These function build null terminated string data of the appropriate width (`u8` in the case of `PCSTR` and `u16` in the other cases). | ||
|
||
**question**: for all these types built from static strings, could we special case them so that they don't use reference counting at all? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just special case for a string literal, then use that with concat!($str, "\0")
, etc. For wide strings, you'd probably need something like const_heap
(nightly api very far from stable) or compiler support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This refers to whether we need a special version of HSTRING
and the like that doesn't track reference counts since its backing memory is statically allocated. Actually statically allocating enough memory for a wide string is already a solved problem (e.g., see here).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I didn't realise that the const on stable had got that far. EDIT: And I forgot that you could use it to get a length for a constant array...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So essentially, you'd get a wide string reference (PCWSTR
), then pass that to WindowsCreateStringReference
(in windows::Win32::System::WinRT
), then use the resultant HSTRING
structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using WindowsCreateStringReference
is generally a pessimization unless the calling language natively supports null terminated wide string literals, like C# and C++. For languages like Rust, that's not worth it as the resultant HSTRING
typically ends up being copied almost immediately.
|
||
It's clear that "IN" params are the scenario where ownership differs from the status quo. Therefore, the `windows` crate will treat the types as the following. | ||
|
||
* Introduce a new type `CWString` which is the wide string equivalent to `CString`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll review more carefully when I get a moment but my most obvious question is what CWString
gives you that HSTRING
doesn't already provide. Obviously, there are already a dizzying number of string types and I'd love to not make that dizzier. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? The difference is that HSTRING
has the added complexity of being reference counted. CWString
would have the added benefit of being symmetrical with the Rust std library CString
type, but I agree that we want to try to limit the explosion of types if we can...
|
||
#### Windows | ||
|
||
Windows has the following string types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to look at this is that Windows has a variety of string types that need to be supported for interop with various APIs but that languages can and should only use them on the ABI boundaries as much as is practical. The extreme example of this is C# where only System.String
exists in C# and all the different string types are marshaled away behind the scenes. While I'm not suggesting we go quite that far, we should try to think of these as interop types needed only for the ABI boundary and put our energy into making it as efficient and seamless as possible to go from and get back to native and idiomatic Rust string types, and not try to make Windows replacements with all the bells and whistles of Rust's string types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For PCSTR
, we can just use *const c_char
directly, and CStr
provides from_ptr
and as_ptr
methods to convert to and from it. For PSTR
, we would need CString
and CStr
to allow mutability.
For PWSTR
and PCWSTR
, it would need roughly the same, except with wide characters. PCWSTR
(I think) should be able to be created from a BSTR
or HSTRING
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be the ideal situation.
However, the unfortunate situation with Rust is, that Rust does not provide a zero-overhead string type that's compatible with Windows' native string encoding. Vec<u16>
/&[u16]
is the best Rust has to offer in this respect. I don't see any way "to go from and get back to native and idiomatic Rust string types" at all.
Being pragmatic here, and assuming that unmatched surrogate pairs won't appear in practice is a dangerous route to go down. Doing so will open up code based on the windows
crate to attacks, even if only DoS attacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed that's the ideal hence the "as much as is practical". So we need a null terminated wide char but we need such things as little as humanly possible with the emphasis on interop rather than string features. And since we need HSTRING anyway and HSTRING is a null terminated wide char string that should cover those needs. Even if we end up calling it CWString, for symmetry, that just happens to be implemented as an HSTRING we can at least minimize the number of implementations we're carrying around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. I am a bit skeptical that we'll actually save much code by using an HSTRING
as a backing implementation detail, but if this is truly just an implementation detail then it doesn't really matter. We can keep playing with the implementation without impacting users.
|
||
* `PSTR`: a mutable, nul-terminated string (i.e., composed of "characters" a.k.a. `u8`s) that is often but not always ANSI encoded. | ||
* `PCSTR`: an immutable version of `PSTR`. | ||
* **QUESTION**: what are the practical differences between this and `PSTR`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that PCSTR
can be created from a CStr
reference, and can also be converted back.
EDIT: That is because CStr
doesn't allow mutability.
I would also add that there are some more rough edges with functions that take even more variations on strings. There's UNICODE_STRING which is roughly equivalent to the Rust types Also there are functions such as And some functions such as While these should not be too difficult to support, I think it's worth noting there are a number of variations on the basic |
Right, which is why I'm reluctant to add some new synthetic string type like To @ChrisDenton's point, it is also not the goal of the |
```rust | ||
let x: CWString = w!("hello"); | ||
let y: CString = c!("hello"); | ||
let z: HSTRING = h!("hello"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need these macros to produce allocated strings. That requires runtime support. They should all be able to return const
string literals of PCWSTR
and PCSTR
respectively. Then they can simply be used to call functions like MessageBoxA/W
directly and efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Common Windows string types like HSTRING
and BSTR
can then just provide From<&T>
implementations for PCWSTR
for cases where a string needs to be computed at runtime and in future if/when CString
and CWString
are available in the std library we can just provide From
implementations for those as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we shouldn't need the Into<T>
below, unless I'm forgetting something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's what I have in mind: https://github.com/microsoft/windows-rs/compare/utf16?expand=1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this is compatible with the existing state of the windows
crate so would make it easier to land the Borrowed
PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal is a subset of what I had in mind so I think it's fine to try it out. We can always expand from there to have dedicated CWString
types and such.
* `PSTR`: a mutable, nul-terminated string (i.e., composed of "characters" a.k.a. `u8`s) that is often but not always ANSI encoded. | ||
* `PCSTR`: an immutable version of `PSTR`. | ||
* **QUESTION**: what are the practical differences between this and `PSTR`? | ||
* `PWSTR`: a mutable, nul-terminated "wide" string (i.e., composed of "wide characters" a.k.a. `u16`s) that is often but not always UTF-16 encoded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[...] that is often but not always UTF-16 encoded.
Can you expand on this? I'm not aware of any non-UTF-16 encoded PWSTRs. Are you instead trying to convey that a grapheme may span 1 or 2 code 16-bit units (2-4 bytes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to create invalid UTF-16 strings because the kernel does not check validity. This is exceedingly rare in practice but it is possible, especially if there's a malicious app causing mischief.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The prime candidate here is the filesystem. NTFS allows names to be just about any sequence of 16-bit values (with a few reserved code units). Those names will inevitably be returned from FindFirstFile
/FindNextFile
, so PWSTR
/PWCSTR
cannot make any assumptions about the validity of the UTF-16 encoding.
Likewise, named kernel objects (events, mutexes, pipes, ...) can have names that aren't valid UTF-16.
Not quite the norm, but invalid UTF-16 sequences are observed in the wild, usually as a result of an accident/bug, or, more frequently, in attacks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no use trying to define what these are. They're just pointer typedef
s, which is why windows-rs
defines them as simply as possible. We can provide safe ways to create them and pass them to API calls, but we cannot make any guarantees about them or provide generally safe conversions from them.
This PR's pretty old. Can we move it along or just close it for now? |
This is an RFC for how we'll handle the many types of Strings in Rust and Windows and how they interact with each other.
This is a complicated question, so I imagine we'll probably not get it right, but I do believe this is an improvement over the status quo not only in terms of useability but also safety.