default string hash produces a lot of consecutive ints for (gensym) symbols #1520

ianthehenry · 2024-11-17T05:57:47Z

While debugging #1519 I noticed that the symbol cache is very densely packed with gensyms, which made the hash collision that breaks symbol/slice far more likely than it "should" be. (The application that actually triggered that bug had to traverse 4072 full buckets during the lookup in question, even though I think the resizing tries to keep the cache only half full.)

repl:1:> (hash (gensym))
135496779
repl:2:> (hash (gensym))
135496780
repl:3:> (hash (gensym))
135496781
repl:4:> (hash (gensym))
135496782
repl:5:> (hash (gensym))
135496773
repl:6:> (hash (gensym))
135496774

A cheap workaround for this particular issue would be to hash strings in reverse order, which seems worth it given how much of the symbol cache consists of gensyms during compilation, but I didn't want to do that in case someone proposes an altogether better hash function.

The text was updated successfully, but these errors were encountered:

bakpakin · 2024-11-17T17:30:50Z

A simple fix for this is to just add some extra hash mixing at the end of the string hash. We already do this for tuples and other structures to improve the hashing quality - for strings we are still just using a very basic DJB hash.

bakpakin · 2024-11-17T17:33:27Z

Pushed 5d1bd8a that uses our existing janet_hash_mix routine to add the string length into the string hash as well a make sure gensyms are far apart.

sogaiu · 2024-11-18T01:17:17Z

It's my own fault but some of my tests seem to have been relying on internal details and they now fail with 5d1bd8a.

As a specific example, as might be expected, the return value of things like pairs can be different from before.

With the change I now get:

$ janet
Janet 1.37.0-dev-5d1bd8a9 linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:b 2) (:a 1)]

before the change this was:

$ janet
Janet 1.37.0-dev-bafa6bff linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:a 1) (:b 2)]

Time to rewrite some tests perhaps (^^;

Update: tests have been updated. Found and fixed some tests that were problematic for other reasons so perhaps there was a net gain :)

ianthehenry · 2024-11-18T04:21:49Z

Running Bauble's test suite before and after the new hash function:

Benchmark 1: ~/bin/janet-bafa6bff jpm_tree/bin/judge
  Time (mean ± σ):     654.0 ms ±   3.8 ms    [User: 632.7 ms, System: 19.8 ms]
  Range (min … max):   648.5 ms … 657.7 ms    10 runs

Benchmark 2: ~/bin/janet-5d1bd8a9 jpm_tree/bin/judge
  Time (mean ± σ):     607.1 ms ±   2.2 ms    [User: 586.7 ms, System: 18.9 ms]
  Range (min … max):   604.1 ms … 610.9 ms    10 runs

Not bad!

sogaiu · 2024-11-18T13:27:12Z

Hurray for ⌊(2^32)⁄𝜙⌋.

ianthehenry closed this as completed Nov 18, 2024

This was referenced Nov 25, 2024

Advertise client capability of janet-netrepl janet-lang/spork#203

Merged

Define symbol early enough for use in math janet-lang/spork#204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

default string hash produces a lot of consecutive ints for (gensym) symbols #1520

default string hash produces a lot of consecutive ints for (gensym) symbols #1520

ianthehenry commented Nov 17, 2024 •

edited

Loading

bakpakin commented Nov 17, 2024

bakpakin commented Nov 17, 2024

sogaiu commented Nov 18, 2024 •

edited

Loading

ianthehenry commented Nov 18, 2024

sogaiu commented Nov 18, 2024 •

edited

Loading

default string hash produces a lot of consecutive ints for (gensym) symbols #1520

default string hash produces a lot of consecutive ints for (gensym) symbols #1520

Comments

ianthehenry commented Nov 17, 2024 • edited Loading

bakpakin commented Nov 17, 2024

bakpakin commented Nov 17, 2024

sogaiu commented Nov 18, 2024 • edited Loading

ianthehenry commented Nov 18, 2024

sogaiu commented Nov 18, 2024 • edited Loading

ianthehenry commented Nov 17, 2024 •

edited

Loading

sogaiu commented Nov 18, 2024 •

edited

Loading

sogaiu commented Nov 18, 2024 •

edited

Loading