-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[API Proposal]: string.GetHashCodeNonRandomized #77679
Comments
Tagging subscribers to this area: @dotnet/area-system-runtime Issue DetailsBackground and motivationIn .NET Core, string hash codes are always randomized. This is critical to avoid certain kinds of attacks when adding arbitrary, untrusted inputs into types like API Proposalnamespace System
{
public sealed class String
{
public static int GetHashCode(ReadOnlySpan<char> value);
public static int GetHashCode(ReadOnlySpan<char> value, StringComparison comparisonType);
+ public static int GetHashCodeNonRandomized(ReadOnlySpan<char> value);
+ public static int GetHashCodeNonRandomized(ReadOnlySpan<char> value, StringComparison comparisonType);
}
} API Usageint hashcode = string.GetHashCodeNonRandomized(value, StringComparison.OrdinalIgnoreCase); Alternative DesignsWe could instead or in addition expose namespace System
{
public abstract class StringComparer
{
public static StringComparer Ordinal { get; }
public static StringComparer OrdinalIgnoreCase { get; }
+ public static StringComparer OrdinalNonRandomized { get; }
+ public static StringComparer OrdinalIgnoreCaseNonRandomized { get; }
}
} If we did that instead of the proposed APIs, we should also consider adding RisksA risk could be developers defaulting to using these instead of the randomized implementations in situations where the randomized implementations are warranted. However, some developers are already writing their own hash implementations to avoid the randomized overhead, and their implementations may be worse or less efficient than what's already in the box.
|
I'd rather see it as new |
The proposal has them as a static methods, not instance. (Also as noted, we'd need support for |
|
From reading the source code, it seems that the randomized comparer calls an entirely different algorithm. Can you educate me on why that is? Why does it not call the same algorithm just with a different seed? Lines 58 to 72 in 40df8f6
Lines 47 to 53 in 40df8f6
runtime/src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs Lines 821 to 851 in 40df8f6
It seems that the non-randomized version is not seeded at all and will therefore result in the same output across process restarts. This could lead people to take a dependency on the concrete algorithm (although they should not be doing that). This could be avoided by xoring the result with a per-process constant. The performance cost of that should be very small. |
Isn't that why it's called "NonRandomized" ? 🙂
See "Background and motivation" section of this proposal, Marvin32 hash code is good, but it's obviously slower than a simpler non-randomized implementation. |
@Nuklon I think it's better this way |
Personally I would vote for option 2 - a new comparer as on String class it feels like it's kinda leaking implementation details. String methods are usually "generic" like Contains, StartsWith etc. And this one is just way to specific to be on String class. Also, I wouldn't personally want newcomers to find this faster than they should :) This kind of thing is more or less oriented on library authors I guess. |
When I investigated the new |
No. It is public in CoreLib just to provide binary formatter compatibility as the comment says. It is not part of the public .NET APIs. |
Background and motivation
In .NET Core, string hash codes are always randomized. This is critical to avoid certain kinds of attacks when adding arbitrary, untrusted inputs into types like
Dictionary<,>
andHashSet<>
. However, for situations where the inputs are trusted, the overhead of these randomized hash codes makes them measurably more expensive than their non-randomized counterparts. As such,Dictionary<,>
andHashSet<>
both start out with non-randomized hash codes and only upgrade to randomized ones when enough collisions are detected. Such a capability is valuable for other collection types as well, but the raw primitives (the non-randomized hash code implementations) aren't trivial to implement efficiently and aren't exposed.API Proposal
API Usage
Alternative Designs
We could instead or in addition expose
StringComparer
singletons:If we did that instead of the proposed APIs, we should also consider adding
Equals
/GetHashCode
overloads forReadOnlySpan<char>
toStringComparer
(something we might want to do anyway as part of #27229).Risks
A risk could be developers defaulting to using these instead of the randomized implementations in situations where the randomized implementations are warranted. However, some developers are already writing their own hash implementations to avoid the randomized overhead, and their implementations may be worse or less efficient than what's already in the box.
The text was updated successfully, but these errors were encountered: