Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impl Distribution<u8> for Alphanumeric #935

Closed
wants to merge 1 commit into from
Closed

Conversation

qoh
Copy link
Contributor

@qoh qoh commented Jan 21, 2020

Sampling a random alphanumeric string by collecting chars (that are known to be ASCII) into a String involves emitting code for re-allocation as String is encoding to UTF-8, via the example:

let chars: String = iter::repeat(())
        .map(|()| rng.sample(Alphanumeric))
        .take(7)
        .collect();

I wanted to get rid of the clearly unnecessary re-allocation branches in my application, so I needed to be able to access to the ASCII characters as simple bytes. It seems like that was already what was going on inside Alphanumeric however, it was just internal.

This PR changes the Distribution<char> impl to provide u8s (which it generated internally already) instead, and implements the previous Distribution<char> using it. One could then for example do:

let mut rng = thread_rng();
let bytes = (0..7).map(|_| rng.sample(Alphanumeric)).collect();
let chars = unsafe { String::from_utf8_unchecked(bytes) };

Sampling a random alphanumeric string by collecting chars (that are known to be ASCII) into a String involves re-allocation as String is encoding to UTF-8, via the example:

```rust
let chars: String = iter::repeat(())
        .map(|()| rng.sample(Alphanumeric))
        .take(7)
        .collect();
```

I wanted to get rid of the clearly unnecessary re-allocations in my applications, so I needed to be able to access to the ASCII characters as simple bytes. It seems like that was already what was going on inside Alphanumeric however, it was just internal.

This PR changes the `Distribution<char>` impl to provide `u8`s (which it generates internally) instead, and implements the previous `Distribution<char>` using it. One could then, for example, do this:

```rust
let mut rng = thread_rng();
let bytes = (0..7).map(|_| rng.sample(ByteAlphanumeric)).collect();
let chars = unsafe { String::from_utf8_unchecked(bytes) };
```
@qoh
Copy link
Contributor Author

qoh commented Jan 21, 2020

Per the test failure this is however blocking, as let c = rng.sample(Alphanumeric) alone is no longer enough to infer the type of c. I'm not sure what the appropriate direction to go with regard to that is. Should the API change in that way, or should it be a separate ByteAlphanumeric type?

@dhardy
Copy link
Member

dhardy commented Jan 21, 2020

If there was a way to directly convert an ASCII char to u8, then the optimiser might be able to solve this problem for you (assuming it can prove the char is a u8); unfortunately it appears there isn't.

I guess you could test the following but it's throwing considerably more work at the optimiser:

let bytes = (0..7).map(|_| {
        let mut buf = [0u8; 4];
        let len = rng.sample(Alphanumeric).encode_utf8(&mut buf).len();
        debug_assert_eq!(len, 1);
        buf[0]
    ).collect();

@dhardy
Copy link
Member

dhardy commented Jan 21, 2020

You can use as for conversion, hence the following should do the trick:

let bytes = (0..7).map(|_| rng.sample(Alphanumeric) as u8).collect();

Can you test whether this is efficient enough for you? If so, there's no need to extend the API.

@qoh
Copy link
Contributor Author

qoh commented Jan 21, 2020

That would also work, though at least to me involving the mental/safety review overhead of [<u8> as char] as u8 needlessly does seem like a bit of a shame. I would appreciate the Distribution<u8> impl being available on either Alphanumeric or a separate ByteAlphanumeric, but it's totally understandable if you don't wish to extend/modify the API as such.

@dhardy
Copy link
Member

dhardy commented Jan 21, 2020

API design is a tradeoff between convenience, simplicity, completeness, minimality and avoiding breaking changes. Adding two APIs to do essentially the same thing which is already a tangential aspect of a much larger library may not be justified (based on a single request and apparently low usage of Alphanumeric).

@dhardy
Copy link
Member

dhardy commented Mar 9, 2020

Failing significant interest, I will close this with the recommendation that users needing this behaviour implement the distribution themselves or a workaround like above.

This PR will remain open for comment until closer to the 0.8 release.

@vks
Copy link
Collaborator

vks commented Aug 1, 2020

I think it might make sense to make Alphanumeric implement Distribution<u8>, since u8 is generated internally and can be easily converted to char via From and Into. This would be in line with our recent decision to make Poisson implement Distribution<f64> instead of Distribution<u64>.

@dhardy
Copy link
Member

dhardy commented Aug 1, 2020

It's not really equivalent though, is it, @vks? Numeric distributions are usually able to deduce their type from arguments, Poisson included.

There isn't a strong argument either way in this case, so either resolution is acceptable IMO.

@abonander
Copy link

based on a single request and apparently low usage of Alphanumeric

@dhardy for future reference, a Github search probably isn't sufficient to determine usage of an API, although I don't really have an alternative to offer here. The dropping of impl Distribution<char> for Alphanumeric was mildly annoying when upgrading rand on a couple proprietary projects. It's pretty useful to be able to easily generate a random string for things like Nonces in OAuth 1.0 requests or session IDs for a web application.

Granted the fix is pretty easy, although not immediately obvious. I wasn't actually aware you could cast a u8 to char or that there was impl From<u8> for char. I initially thought I'd have to collect to a Vec<u8> and then use String::from_utf8() which I wasn't particularly excited about. I had to find #1012 to notice the change to the documentation examples and realize I could just do .map(char::from).

@dhardy
Copy link
Member

dhardy commented May 19, 2021

@abonander sorry to hear it wasn't so clear how to use the new Alphanumeric. Do you have suggestions of how to improve the documentation? If so it would be better to open a new issue (and refer to this one) so we can track it.

I guess you missed the example in the docs?

@vks
Copy link
Collaborator

vks commented May 19, 2021

We also mention it in our update guide. Maybe it's not visible enough?

@abonander
Copy link

abonander commented May 19, 2021

I guess you missed the example in the docs?

I honestly glanced past it. The heading could be a bit more eye-catching, such as: Example: Random Alphanumeric String

We also mention it in our update guide. Maybe it's not visible enough?

I wasn't aware that existed, actually. I see it's mentioned in the README but it's kind of buried in a wall of text.

The change to the impl isn't really a satisfying solution to the initial problem though because it requires using unsafe { String::from_utf8_unchecked() } for optimal performance. .map(char::from) isn't really any different than the original Distribution<char> impl.

What would be really convenient, performant, and easy to enforce as secure, would be an inherent method on Alphanumeric:

impl Alphanumeric {
    pub fn gen_string(&self, rng: &mut impl CryptoRng, len: usize) -> String {
        // there's no real difference in implementation here
        // the different names and bounds are just to make the user think twice about which RNG they're using
        self.gen_string_insecure(rng, len)
    }

    /// An alternative to `gen_string` when the random string isn't being used for secure contexts, e.g. as random input for tests.
    pub fn gen_string_insecure(&self, rng: &mut impl Rng, len: usize) -> String {
        let bytes: Vec<u8> = self.sample_iter(rng).take(len).collect();

        if cfg!(debug_assertions) {
            // sanity-check the distribution in testing
            String::from_utf8(bytes).expect("BUG: Alphanumeric generates invalid bytes")
        } else {
            unsafe { String::from_utf8_unchecked(bytes) }
        }
    }
}

Projects could even use #[deny(clippy::disallowed_method)] to forbid using Alphanumeric::gen_string_insecure()

@Lucretiel
Copy link
Contributor

Out of curiosity, was there a specific reason to remove Distribution<char>? I understand and agree with the rationale for adding Distribution<u8>, but couldn't it simply implement both?

@dhardy
Copy link
Member

dhardy commented Aug 4, 2021

Type inference.

Since conversion to char is easy supporting only u8 seems like the best choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants