Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default string hash produces a lot of consecutive ints for (gensym) symbols #1520

Closed
ianthehenry opened this issue Nov 17, 2024 · 5 comments
Closed

Comments

@ianthehenry
Copy link
Contributor

ianthehenry commented Nov 17, 2024

While debugging #1519 I noticed that the symbol cache is very densely packed with gensyms, which made the hash collision that breaks symbol/slice far more likely than it "should" be. (The application that actually triggered that bug had to traverse 4072 full buckets during the lookup in question, even though I think the resizing tries to keep the cache only half full.)

repl:1:> (hash (gensym))
135496779
repl:2:> (hash (gensym))
135496780
repl:3:> (hash (gensym))
135496781
repl:4:> (hash (gensym))
135496782
repl:5:> (hash (gensym))
135496773
repl:6:> (hash (gensym))
135496774

A cheap workaround for this particular issue would be to hash strings in reverse order, which seems worth it given how much of the symbol cache consists of gensyms during compilation, but I didn't want to do that in case someone proposes an altogether better hash function.

@bakpakin
Copy link
Member

A simple fix for this is to just add some extra hash mixing at the end of the string hash. We already do this for tuples and other structures to improve the hashing quality - for strings we are still just using a very basic DJB hash.

@bakpakin
Copy link
Member

Pushed 5d1bd8a that uses our existing janet_hash_mix routine to add the string length into the string hash as well a make sure gensyms are far apart.

@sogaiu
Copy link
Contributor

sogaiu commented Nov 18, 2024

It's my own fault but some of my tests seem to have been relying on internal details and they now fail with 5d1bd8a.

As a specific example, as might be expected, the return value of things like pairs can be different from before.

With the change I now get:

$ janet
Janet 1.37.0-dev-5d1bd8a9 linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:b 2) (:a 1)]

before the change this was:

$ janet
Janet 1.37.0-dev-bafa6bff linux/x64/gcc - '(doc)' for help
repl:1:> (pairs {:a 1 :b 2})
@[(:a 1) (:b 2)]

Time to rewrite some tests perhaps (^^;


Update: tests have been updated. Found and fixed some tests that were problematic for other reasons so perhaps there was a net gain :)

@ianthehenry
Copy link
Contributor Author

Running Bauble's test suite before and after the new hash function:

Benchmark 1: ~/bin/janet-bafa6bff jpm_tree/bin/judge
  Time (mean ± σ):     654.0 ms ±   3.8 ms    [User: 632.7 ms, System: 19.8 ms]
  Range (min … max):   648.5 ms … 657.7 ms    10 runs

Benchmark 2: ~/bin/janet-5d1bd8a9 jpm_tree/bin/judge
  Time (mean ± σ):     607.1 ms ±   2.2 ms    [User: 586.7 ms, System: 18.9 ms]
  Range (min … max):   604.1 ms … 610.9 ms    10 runs

Not bad!

@sogaiu
Copy link
Contributor

sogaiu commented Nov 18, 2024

Hurray for ⌊(2^32)⁄𝜙⌋.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants