-
-
Notifications
You must be signed in to change notification settings - Fork 420
fix Issue 14385 - AA should use open addressing hash #1229
Conversation
- new AA implementation - uses open addressing with quadratic probing (triangular numbers) and pow2 table - uses NO_SCAN for entries when applicable - minimizes alignment gap for values - calls postblit on aa.keys and aa.values
d8e65bc
to
79bc91b
Compare
Looks good to a first cursory inspection. Can we add some hint to a debugger to figure out what version of the AA implementation is used? Both cv2pdb and mago rebuild the internals to display AA elements. |
What kind of hint would work? |
} | ||
// set hash and blit value | ||
auto pdst = p.entry + off; | ||
pdst[0 .. valsz] = pval[0 .. valsz]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment why no postblit is necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, IIRC the compiler already inserts the necessary postblits for lvalues when constructing an AA literal. Should add a unit test for that anyhow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, that actually revealed a bug in the existing AA.
When values get overwritten during AA literal construction they need to be destroyed.
I think a global version variable should work, as globals are usually not split into seperate COMDATs, so it would not be stripped by the linker:
I just tried this, it is built into the same section as the module info. |
- so that debuggers know how to pretty-print the content of an AA
Done as |
- also destroy values before overwriting them (due to duplicate keys) during literal construction
a554721
to
88ee8db
Compare
Should we move the keys into their own array in lock-step with the bucket array? That would avoid the tiEntry machinery which gets even worse when trying to add RTInfo for a precise GC. Pro:
Con:
|
I think this AA implementation is a nice improvement and fixes a number of issues regarding postblit and destructor calls. It is not a small change, though, so I'd like some AA experts to chime in. |
Will try if that helps, though I'd like to avoid further optimization discussions as we can always make things faster tomorrow.
The implementation is state of the art for hash tables, see https://google-sparsehash.googlecode.com/svn/trunk/doc/implementation.html or http://llvm.org/docs/doxygen/html/DenseMap_8h_source.html. The main intention here is to improve the old implementation so that we have a solid AA while transitioning to a library implementation. Only that allows to get a fast AA. |
Then I'm slightly disappointed by the limited improvement that the benchmarks show. It seems there are no objections otherwise, though, so let's get this in. The omission of the |
Auto-merge toggled on |
fix Issue 14385 - AA should use open addressing hash
Yeah, I was a little disappointed as well, the effect is a bit bigger than it appears b/c only a fraction of the benchmark time is actually spent in the AA. Just added another benchmark #1230. Lots of more improvements are possible with a library AA. |
Hmmm, while trying this with the precise GC I noticed that some benchmarks have become considerably slower on Win32 with the new AA implementation, namely bulk (0.290s -> 0.328s), resize (0.388s -> 0.418) and especially testgc3 (1.804s -> 2.203s). Running the benchmarks for Win64 yields results similar to yours. |
Will investigate this. |
Particularly testgc3 and bulk cause different GC pool allocations, causing a big part of the difference. Then I'm also seeing a slight increase in stalled frontend cycles, the mix function seems to be responsible for that. |
I expected the new version to use less memory for 64-bit, but it seems not to change: both versions of testgc3 need 247 MB as reported by the GC, though that's probably much more than the actual live memory. The 32-bit versions both show 117 MB memory usage, and no big change in garbage collection time.
I tried to replace the mix function with a noop, but it did not have a big effect. Not sure how much the testgc3 requires a good hash, though. I also noticed your replacement of |
Might make sense for <4 bytes, but only if inlineable, because memset does the same check. |
OK, the difference seems to come from the size increase of the buckets array. For the testgc3 AAs (with 200 elements) the bucket array is now pushed into the large alloc size class (4096 bytes) which explains the slowdown. |
I recently tried a few modifications, too. The new implementation grows 3 times for the 200 entries, while the old one did this only once. Starting with an initial bucket size of 64 causes also a single rehashing, but it did not make a large difference.
I thought about that, too, but did not try it. Event with size_t hashes, it could avoid an additional cache line read if the hash does not match. |
if (auto p = aa.findSlotLookup(hash, pkey, ti.key)) | ||
return p.entry + aa.keysz + aa.keypad; | ||
|
||
auto p = aa.findSlotInsert(hash); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does a lookup very similar to findSlotLookup
just before. These can be combined, but when trying that, there was only a small improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I separated them because the combined function gets too complex.
I tried this and it's quite a bit faster for testgc3 on 32-bit, though still considerably slower than the old hash. It's definitely slower for anything else because lookup now requires 2 distinct memory accesses. |
I just tried to read this variable from within mago, but unfortunately it is not linked into the binary. It seems the module info isn't there at all, even if I add a class to the module. How's the class factory supposed to work without it? If I add an empty "shared static this()" it works. Should we add it here? Maybe instead of a version for AA, a dmd/druntime/phobos version somewhere in the binary might also be useful to adapt tooling based on the release version. |
Issue 14385 – AA should use open addressing hash