[API Proposal]: string.GetHashCodeNonRandomized #77679

stephentoub · 2022-10-31T14:08:04Z

Background and motivation

In .NET Core, string hash codes are always randomized. This is critical to avoid certain kinds of attacks when adding arbitrary, untrusted inputs into types like Dictionary<,> and HashSet<>. However, for situations where the inputs are trusted, the overhead of these randomized hash codes makes them measurably more expensive than their non-randomized counterparts. As such, Dictionary<,> and HashSet<> both start out with non-randomized hash codes and only upgrade to randomized ones when enough collisions are detected. Such a capability is valuable for other collection types as well, but the raw primitives (the non-randomized hash code implementations) aren't trivial to implement efficiently and aren't exposed.

API Proposal

namespace System
{
    public sealed class String
    {
        public static int GetHashCode(ReadOnlySpan<char> value);
        public static int GetHashCode(ReadOnlySpan<char> value, StringComparison comparisonType);

+       public static int GetHashCodeNonRandomized(ReadOnlySpan<char> value);
+       public static int GetHashCodeNonRandomized(ReadOnlySpan<char> value, StringComparison comparisonType);
    }
}

API Usage

int hashcode = string.GetHashCodeNonRandomized(value, StringComparison.OrdinalIgnoreCase);

Alternative Designs

We could instead or in addition expose StringComparer singletons:

namespace System
{
    public abstract class StringComparer
    {
        public static StringComparer Ordinal { get; }
        public static StringComparer OrdinalIgnoreCase { get; }

+       public static StringComparer OrdinalNonRandomized { get; }
+       public static StringComparer OrdinalIgnoreCaseNonRandomized { get; }
    }
}

If we did that instead of the proposed APIs, we should also consider adding Equals/GetHashCode overloads for ReadOnlySpan<char> to StringComparer (something we might want to do anyway as part of #27229).

Risks

A risk could be developers defaulting to using these instead of the randomized implementations in situations where the randomized implementations are warranted. However, some developers are already writing their own hash implementations to avoid the randomized overhead, and their implementations may be worse or less efficient than what's already in the box.

The text was updated successfully, but these errors were encountered:

stephentoub · 2022-10-31T14:08:43Z

cc: @GrabYourPitchforks, @bartonjs, @terrajobst

ghost · 2022-10-31T14:09:02Z

Tagging subscribers to this area: @dotnet/area-system-runtime
See info in area-owners.md if you want to be subscribed.

Issue Details

Background and motivation

In .NET Core, string hash codes are always randomized. This is critical to avoid certain kinds of attacks when adding arbitrary, untrusted inputs into types like Dictionary<,> and HashSet<>. However, for situations where the inputs are trusted, the overhead of these randomized hash codes makes them measurably more expensive than their non-randomized counterparts. As such, Dictionary<,> and HashSet<> both start out with non-randomized hash codes and only upgrade to randomized ones when enough collisions are detected. Such a capability is valuable for other collection types as well, but the raw primitives (the non-randomized hash code implementations) aren't trivial to implement efficiently and aren't exposed.

API Proposal

namespace System
{
    public sealed class String
    {
        public static int GetHashCode(ReadOnlySpan<char> value);
        public static int GetHashCode(ReadOnlySpan<char> value, StringComparison comparisonType);

+       public static int GetHashCodeNonRandomized(ReadOnlySpan<char> value);
+       public static int GetHashCodeNonRandomized(ReadOnlySpan<char> value, StringComparison comparisonType);
    }
}

API Usage

int hashcode = string.GetHashCodeNonRandomized(value, StringComparison.OrdinalIgnoreCase);

Alternative Designs

We could instead or in addition expose StringComparer singletons:

namespace System
{
    public abstract class StringComparer
    {
        public static StringComparer Ordinal { get; }
        public static StringComparer OrdinalIgnoreCase { get; }
+       public static StringComparer OrdinalNonRandomized { get; }
+       public static StringComparer OrdinalIgnoreCaseNonRandomized { get; }
    }
}

If we did that instead of the proposed APIs, we should also consider adding Equals/GetHashCode overloads for ReadOnlySpan<char> to StringComparer (something we might want to do anyway as part of #27229).

Risks

A risk could be developers defaulting to using these instead of the randomized implementations in situations where the randomized implementations are warranted. However, some developers are already writing their own hash implementations to avoid the randomized overhead, and their implementations may be worse or less efficient than what's already in the box.

Author:	stephentoub
Assignees:	-
Labels:	`api-suggestion`, `area-System.Runtime`
Milestone:	8.0.0

bartonjs · 2022-10-31T16:29:00Z

I'd rather see it as new StringComparers. It doesn't seem to me to be a universal enough need on every String to be worthy of an instance method.

stephentoub · 2022-10-31T16:30:53Z

It doesn't seem to me to be a universal enough need on every String to be worthy of an instance method.

The proposal has them as a static methods, not instance.

(Also as noted, we'd need support for ReadOnlySpan<char> inputs, so as just a StringComparer, we'd need to either augment StringComparer to also supports spans, or expose separate span-based methods as well.)

Nuklon · 2022-11-01T12:28:31Z

GetNonRandomizedHashCode instead? GetHashCodeNonRandomized is some odd English.

GSPP · 2022-11-02T11:21:47Z

From reading the source code, it seems that the randomized comparer calls an entirely different algorithm. Can you educate me on why that is? Why does it not call the same algorithm just with a different seed?

runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/RandomizedStringEqualityComparer.cs

Lines 58 to 72 in 40df8f6

    
               public override int GetHashCode(string? obj) 
        
               { 
        
                   if (obj is null) 
        
                   { 
        
                       return 0; 
        
                   } 
        
                   // The Ordinal version of Marvin32 operates over bytes. 
        
                   // The multiplication from # chars -> # bytes will never integer overflow. 
        
                   return Marvin.ComputeHash32( 
        
                       ref Unsafe.As<char, byte>(ref obj.GetRawStringData()), 
        
                       (uint)obj.Length * 2, 
        
                       _seed.p0, _seed.p1); 
        
               } 
        
           }

runtime/src/libraries/System.Private.CoreLib/src/System/Collections/Generic/NonRandomizedStringEqualityComparer.cs

Lines 47 to 53 in 40df8f6

    
           public virtual int GetHashCode(string? obj) 
        
           { 
        
               // This instance may have been deserialized into a class that doesn't guarantee 
        
               // these parameters are non-null. Can't short-circuit the null checks. 
        
               return obj?.GetNonRandomizedHashCode() ?? 0; 
        
           }

runtime/src/libraries/System.Private.CoreLib/src/System/String.Comparison.cs

Lines 821 to 851 in 40df8f6

    
           internal unsafe int GetNonRandomizedHashCode() 
        
           { 
        
               fixed (char* src = &_firstChar) 
        
               { 
        
                   Debug.Assert(src[this.Length] == '\0', "src[this.Length] == '\\0'"); 
        
                   Debug.Assert(((int)src) % 4 == 0, "Managed string should start at 4 bytes boundary"); 
        
                   uint hash1 = (5381 << 16) + 5381; 
        
                   uint hash2 = hash1; 
        
                   uint* ptr = (uint*)src; 
        
                   int length = this.Length; 
        
                   while (length > 2) 
        
                   { 
        
                       length -= 4; 
        
                       // Where length is 4n-1 (e.g. 3,7,11,15,19) this additionally consumes the null terminator 
        
                       hash1 = (BitOperations.RotateLeft(hash1, 5) + hash1) ^ ptr[0]; 
        
                       hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[1]; 
        
                       ptr += 2; 
        
                   } 
        
                   if (length > 0) 
        
                   { 
        
                       // Where length is 4n-3 (e.g. 1,5,9,13,17) this additionally consumes the null terminator 
        
                       hash2 = (BitOperations.RotateLeft(hash2, 5) + hash2) ^ ptr[0]; 
        
                   } 
        
                   return (int)(hash1 + (hash2 * 1566083941)); 
        
               } 
        
           }

It seems that the non-randomized version is not seeded at all and will therefore result in the same output across process restarts. This could lead people to take a dependency on the concrete algorithm (although they should not be doing that). This could be avoided by xoring the result with a per-process constant. The performance cost of that should be very small.

EgorBo · 2022-11-02T11:29:17Z

It seems that the non-randomized version is not seeded at all and will therefore result in the same output across process restarts. This could be avoided by xoring the result with a per-process constant. The performance cost of that should be very small.

Isn't that why it's called "NonRandomized" ? 🙂

From reading the source code, it seems that the randomized comparer calls an entirely different algorithm. Can you educate me on why that is?

See "Background and motivation" section of this proposal, Marvin32 hash code is good, but it's obviously slower than a simpler non-randomized implementation.

ArminShoeibi · 2022-11-30T19:04:32Z

GetNonRandomizedHashCode instead? GetHashCodeNonRandomized is some odd English.

@Nuklon I think it's better this way GetHashCodeNonRandomized because of Intellisense and ordering.

En3Tho · 2022-11-30T19:22:36Z

Personally I would vote for option 2 - a new comparer as on String class it feels like it's kinda leaking implementation details. String methods are usually "generic" like Contains, StartsWith etc. And this one is just way to specific to be on String class. Also, I wouldn't personally want newcomers to find this faster than they should :) This kind of thing is more or less oriented on library authors I guess.

koszeggy · 2024-08-28T09:28:02Z

When I investigated the new IAlternateEqualityComparer<,> interface I noticed that now there is a public NonRandomizedStringEqualityComparer class in the code base but for some reason I cannot access it in Preview 7. Will it be available in .NET 9? But it would be equally alright if the proposed StringComparer.OrdinalNonRandomized/OrdinalIgnoreCaseNonRandomized properties exposed it.

jkotas · 2024-08-28T14:29:36Z

Will it be available in .NET 9?

No. It is public in CoreLib just to provide binary formatter compatibility as the comment says. It is not part of the public .NET APIs.

stephentoub added api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Runtime labels Oct 31, 2022

stephentoub added this to the 8.0.0 milestone Oct 31, 2022

tarekgh mentioned this issue Nov 22, 2022

Libraries work to support Cloud Native effort for .NET 8 #78518

Closed

61 tasks

dakersnar mentioned this issue Nov 30, 2022

System.Runtime work planned for .NET 8 #79053

Closed

31 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[API Proposal]: string.GetHashCodeNonRandomized #77679

[API Proposal]: string.GetHashCodeNonRandomized #77679

stephentoub commented Oct 31, 2022 •

edited

Loading

stephentoub commented Oct 31, 2022 •

edited

Loading

ghost commented Oct 31, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

bartonjs commented Oct 31, 2022

stephentoub commented Oct 31, 2022

Nuklon commented Nov 1, 2022

GSPP commented Nov 2, 2022

EgorBo commented Nov 2, 2022

ArminShoeibi commented Nov 30, 2022 •

edited

Loading

En3Tho commented Nov 30, 2022

koszeggy commented Aug 28, 2024

jkotas commented Aug 28, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

[API Proposal]: string.GetHashCodeNonRandomized #77679

[API Proposal]: string.GetHashCodeNonRandomized #77679

Comments

stephentoub commented Oct 31, 2022 • edited Loading

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

stephentoub commented Oct 31, 2022 • edited Loading

ghost commented Oct 31, 2022

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

bartonjs commented Oct 31, 2022

stephentoub commented Oct 31, 2022

Nuklon commented Nov 1, 2022

GSPP commented Nov 2, 2022

EgorBo commented Nov 2, 2022

ArminShoeibi commented Nov 30, 2022 • edited Loading

En3Tho commented Nov 30, 2022

koszeggy commented Aug 28, 2024

jkotas commented Aug 28, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

stephentoub commented Oct 31, 2022 •

edited

Loading

stephentoub commented Oct 31, 2022 •

edited

Loading

ArminShoeibi commented Nov 30, 2022 •

edited

Loading