Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Firefox crashes on Windows because BCryptGenRandom fails #94098

Closed
msmania opened this issue Feb 17, 2022 · 6 comments
Closed

Firefox crashes on Windows because BCryptGenRandom fails #94098

msmania opened this issue Feb 17, 2022 · 6 comments
Labels
C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.

Comments

@msmania
Copy link

msmania commented Feb 17, 2022

Hi, I'm an engineer working for Mozilla on Firefox. We recently noticed we'd been getting crash reports indicating Firefox crashes during env_logger initialization. Upon debugging, we found out it was caused by a failure of BCryptGenRandom, that was introduced by #84096.

Almost all reports are from Windows 7 32-bit. We don't know the reason why BCryptGenRandom failed. Possibly the cryptography settings in the registry was corrupted and the process couldn't find the module.

To mitigate this crash, can you add a fallback logic, using RtlGenRandom if BCryptGenRandom fails for whatever reason. Probably we can limit it to Win7.

Here's the link to the tracking bug in Bugzilla.

https://bugzilla.mozilla.org/show_bug.cgi?id=1754490

@msmania msmania added the C-bug Category: This is a bug. label Feb 17, 2022
@the8472 the8472 added O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Feb 19, 2022
@yaahc yaahc added E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. labels Mar 3, 2022
@newpavlov
Copy link
Contributor

newpavlov commented Mar 3, 2022

Are you sure that RtlGenRandom will work in the case of BCryptGenRandom failure? I don't think it's correct to introduce such workaround hacks into std, since it's clearly an OS problem.

IIUC RtlGenRandom is deprecated in favor of BCryptGenRandom and ideally should not be used outside of Windows XP. Relevant discussion can be found in: rust-random/getrandom#65

One potential solution on the Firefox side is to try to call getrandom and check whether it returns an error. If it does, you could display an error window to user. Entropy retrieval in std and getrandom use nearly identical code, so if the latter has failed, then the former will probably fail as well.

Also note that we can not target only Win7 without affecting other Windows versions using only cfg gating, since Windows version is not part of target triplet.

@Milo123459
Copy link
Contributor

Also note that we can not target only Win7 without affecting other Windows versions using only cfg gating, since Windows version is not part of target triplet.

Is there not some manual way we could check inside of the cfg block?

@marti4d
Copy link
Contributor

marti4d commented May 10, 2022

Hi there! Sorry about the radio silence for a while on this issue -- The OP isn't able to continue working on this, and I've been asked to help resolve it.

I have added PR #96917 that implements the idea of re-introducing RtlGenRandom() as a fallback API in the (very rare) case that BCryptGenRandom() doesn't work on a machine.

Are you sure that RtlGenRandom will work in the case of BCryptGenRandom failure? I don't think it's correct to introduce such workaround hacks into std, since it's clearly an OS problem.

We are very certain of this, as we don't see equivalent crashes in Firefox in our own use of RtlGenRandom, and I don't see any issue reported for Chrome/Edge in their code

We are not convinced that it is an "OS problem", persay... We suspect that a virus scanner or other security software may be interfering with certain machines' ability to load the BCrypt library. We don't even know for certain that this issue only affects Firefox; it could just be that we're the only ones with beefy enough crash reporting to catch it.

I agree with you in principle that it's not great to have to introduce workarounds for things like "a virus scanner is breaking Rust code" into the standard library, but if we assume that Firefox is not the only piece of software that has this issue then where else should the change be made? Even if it is only Firefox, we're left with the options of either "seeing through" the abstraction that std offers and testing with getrandom (only to still crash, ultimately), continuing to crash, or forking std

The PR I have attached should make no difference in the 99.999%+ machines where this is not a problem, but in that small fraction of machines where BCrypt is broken, this only improves the robustness of StdLib.

Also note that we can not target only Win7 without affecting other Windows versions using only cfg gating, since Windows version is not part of target triplet.

As it turns out, the name of this issue is perhaps a bit wrong. This issue is seen on multiple versions of Windows (example crash on Windows 10). My PR is agnostic to this anyway -- if BCryptGenRandom() fails for whatever reason, it will then use RtlGenRandom() for the rest of the runtime.

@ChrisDenton ChrisDenton changed the title Firefox crashes on Windows 7 x86 because BCryptGenRandom fails Firefox crashes on Windows because BCryptGenRandom fails May 10, 2022
@ChrisDenton
Copy link
Member

Thank you for following up on this issue. As you say, we are reliant on Firefox's data here. I've updated the title to better reflect that not just x86 Windows 7 is affected.

Dylan-DPC added a commit to Dylan-DPC/rust that referenced this issue May 17, 2022
Make HashMap fall back to RtlGenRandom if BCryptGenRandom fails

With PR rust-lang#84096, Rust `std::collections::hash_map::RandomState` changed from using `RtlGenRandom()` ([msdn](https://docs.microsoft.com/en-us/windows/win32/api/ntsecapi/nf-ntsecapi-rtlgenrandom)) to `BCryptGenRandom()` ([msdn](https://docs.microsoft.com/en-us/windows/win32/api/bcrypt/nf-bcrypt-bcryptgenrandom)) as its source of secure randomness after much discussion ([here](rust-random/getrandom#65 (comment)), among other places).

Unfortunately, after that PR landed, Mozilla Firefox started experiencing fairly-rare crashes during startup while attempting to initialize the `env_logger` crate. ([docs for env_logger](https://docs.rs/env_logger/latest/env_logger/)) The root issue is that on some machines, `BCryptGenRandom()` will fail with an `Access is denied. (os error 5)` error message. ([Bugzilla issue 1754490](https://bugzilla.mozilla.org/show_bug.cgi?id=1754490)) (Discussion in issue rust-lang#94098)

Note that this is happening upon startup of Firefox's unsandboxed Main Process, so this behavior is different and separate from previous issues ([like this](https://bugzilla.mozilla.org/show_bug.cgi?id=1746254)) where BCrypt DLLs were blocked by process sandboxing. In the case of sandboxing, we knew we were doing something abnormal and expected that we'd have to resort to abnormal measures to make it work.

However, in this case we are in a regular unsandboxed process just trying to initialize `env_logger` and getting a panic. We suspect that this may be caused by a virus scanner or some other security software blocking the loading of the BCrypt DLLs, but we're not completely sure as we haven't been able to replicate locally.

It is also possible that Firefox is not the only software affected by this; we just may be one of the pieces of Rust software that has the telemetry and crash reporting necessary to catch it.

I have read some of the historical discussion around using `BCryptGenRandom()` in Rust code, and I respect the decision that was made and agree that it was a good course of action, so I'm not trying to open a discussion about a return to `RtlGenRandom()`. Instead, I'd like to suggest that perhaps we use `RtlGenRandom()` as a "fallback RNG" in the case that BCrypt doesn't work.

This pull request implements this fallback behavior. I believe this would improve the robustness of this essential data structure within the standard library, and I see only 2 potential drawbacks:

1. Slight added overhead: It should be quite minimal though. The first call to `sys::rand::hashmap_random_keys()` will incur a bit of initialization overhead, and every call after will incur roughly 2 non-atomic global reads and 2 easily predictable branches. Both should be negligible compared to the actual cost of generating secure random numbers
2. `RtlGenRandom()` is deprecated by Microsoft: Technically true, but as mentioned in [this comment on GoLang](golang/go#33542 (comment)), this API is ubiquitous in Windows software and actually removing it would break lots of things. Also, Firefox uses it already in [our C++ code](https://searchfox.org/mozilla-central/rev/5f88c1d6977e03e22d3420d0cdf8ad0113c2eb31/mfbt/RandomNum.cpp#25), and [Chromium uses it in their code as well](https://source.chromium.org/chromium/chromium/src/+/main:base/rand_util_win.cc) (which transitively means that Microsoft uses it in their own web browser, Edge). If there did come a time when Microsoft truly removes this API, it should be easy enough for Rust to simply remove the fallback in the code I've added here
Dylan-DPC added a commit to Dylan-DPC/rust that referenced this issue May 18, 2022
Make HashMap fall back to RtlGenRandom if BCryptGenRandom fails

With PR rust-lang#84096, Rust `std::collections::hash_map::RandomState` changed from using `RtlGenRandom()` ([msdn](https://docs.microsoft.com/en-us/windows/win32/api/ntsecapi/nf-ntsecapi-rtlgenrandom)) to `BCryptGenRandom()` ([msdn](https://docs.microsoft.com/en-us/windows/win32/api/bcrypt/nf-bcrypt-bcryptgenrandom)) as its source of secure randomness after much discussion ([here](rust-random/getrandom#65 (comment)), among other places).

Unfortunately, after that PR landed, Mozilla Firefox started experiencing fairly-rare crashes during startup while attempting to initialize the `env_logger` crate. ([docs for env_logger](https://docs.rs/env_logger/latest/env_logger/)) The root issue is that on some machines, `BCryptGenRandom()` will fail with an `Access is denied. (os error 5)` error message. ([Bugzilla issue 1754490](https://bugzilla.mozilla.org/show_bug.cgi?id=1754490)) (Discussion in issue rust-lang#94098)

Note that this is happening upon startup of Firefox's unsandboxed Main Process, so this behavior is different and separate from previous issues ([like this](https://bugzilla.mozilla.org/show_bug.cgi?id=1746254)) where BCrypt DLLs were blocked by process sandboxing. In the case of sandboxing, we knew we were doing something abnormal and expected that we'd have to resort to abnormal measures to make it work.

However, in this case we are in a regular unsandboxed process just trying to initialize `env_logger` and getting a panic. We suspect that this may be caused by a virus scanner or some other security software blocking the loading of the BCrypt DLLs, but we're not completely sure as we haven't been able to replicate locally.

It is also possible that Firefox is not the only software affected by this; we just may be one of the pieces of Rust software that has the telemetry and crash reporting necessary to catch it.

I have read some of the historical discussion around using `BCryptGenRandom()` in Rust code, and I respect the decision that was made and agree that it was a good course of action, so I'm not trying to open a discussion about a return to `RtlGenRandom()`. Instead, I'd like to suggest that perhaps we use `RtlGenRandom()` as a "fallback RNG" in the case that BCrypt doesn't work.

This pull request implements this fallback behavior. I believe this would improve the robustness of this essential data structure within the standard library, and I see only 2 potential drawbacks:

1. Slight added overhead: It should be quite minimal though. The first call to `sys::rand::hashmap_random_keys()` will incur a bit of initialization overhead, and every call after will incur roughly 2 non-atomic global reads and 2 easily predictable branches. Both should be negligible compared to the actual cost of generating secure random numbers
2. `RtlGenRandom()` is deprecated by Microsoft: Technically true, but as mentioned in [this comment on GoLang](golang/go#33542 (comment)), this API is ubiquitous in Windows software and actually removing it would break lots of things. Also, Firefox uses it already in [our C++ code](https://searchfox.org/mozilla-central/rev/5f88c1d6977e03e22d3420d0cdf8ad0113c2eb31/mfbt/RandomNum.cpp#25), and [Chromium uses it in their code as well](https://source.chromium.org/chromium/chromium/src/+/main:base/rand_util_win.cc) (which transitively means that Microsoft uses it in their own web browser, Edge). If there did come a time when Microsoft truly removes this API, it should be easy enough for Rust to simply remove the fallback in the code I've added here
@bors bors closed this as completed in 0c92519 May 18, 2022
@nagisa
Copy link
Member

nagisa commented Jul 16, 2022

We are not convinced that it is an "OS problem", persay... We suspect that a virus scanner or other security software may be interfering with certain machines' ability to load the BCrypt library. We don't even know for certain that this issue only affects Firefox; it could just be that we're the only ones with beefy enough crash reporting to catch it.

I agree with you in principle that it's not great to have to introduce workarounds for things like "a virus scanner is breaking Rust code" into the standard library, but if we assume that Firefox is not the only piece of software that has this issue then where else should the change be made?

If it is indeed caused by essential snake oils, then I’d argue that Firefox as an application has just large enough of a profile to get the vendors to fix their software. You all aren’t introducing workarounds to enable Firefox’s operation on systems with random memory bit-flips caused by unstable memory, are you? I would argue a similar line of thinking could apply here to at least some extent.

@marti4d
Copy link
Contributor

marti4d commented Jul 18, 2022

We are not convinced that it is an "OS problem", persay... We suspect that a virus scanner or other security software may be interfering with certain machines' ability to load the BCrypt library. We don't even know for certain that this issue only affects Firefox; it could just be that we're the only ones with beefy enough crash reporting to catch it.
I agree with you in principle that it's not great to have to introduce workarounds for things like "a virus scanner is breaking Rust code" into the standard library, but if we assume that Firefox is not the only piece of software that has this issue then where else should the change be made?

If it is indeed caused by essential snake oils, then I’d argue that Firefox as an application has just large enough of a profile to get the vendors to fix their software. You all aren’t introducing workarounds to enable Firefox’s operation on systems with random memory bit-flips caused by unstable memory, are you? I would argue a similar line of thinking could apply here to at least some extent.

FWIW I agree, and we do generally try to get vendors to fix their software: we've worked with many companies like Symantec, AVG, McAfee, etc to fix their products, and I completely agree that this is preferable because it improves the software ecosystem for everyone.

In this case, we were unable to link the BCrypt issue to a specific source; it's still in our queue to investigate it, and once we do I could foresee a future where we push another change to Rust Std to undo the thing I did. In the meantime, someone in the chain has to unfortunately workaround this issue. Given that HashMap is an essential part of Rust Std, we assume that "almost every Rust program crashes on this small set of machines" is not really the kind of optics that the Rust community wants.

It would be nice, of course, if all vendors just followed the APIs they were given properly, but I doubt you'll find any popular language's standard library that doesn't contain a pile of workarounds for this kind of stuff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. E-help-wanted Call for participation: Help is requested to fix this issue. E-medium Call for participation: Medium difficulty. Experience needed to fix: Intermediate. O-windows Operating system: Windows T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

8 participants