fix Issue 14385 - AA should use open addressing hash #1229

MartinNowak · 2015-04-21T01:16:26Z

new AA implementation
uses open addressing with quadratic probing (triangular numbers) and pow2 table
uses NO_SCAN for entries when applicable
minimizes alignment gap for values
calls postblit on aa.keys and aa.values

Issue 14385 – AA should use open addressing hash

MartinNowak · 2015-04-21T01:21:50Z

The conmsg benchmark suffers from increased congestion on the GC lock, because I removed the small inplace AA buckets which take quite a lot of extra memory even though they aren't used for bigger AAs. The congestion problem will get fixed by introducing GC thread-caches.

- new AA implementation - uses open addressing with quadratic probing (triangular numbers) and pow2 table - uses NO_SCAN for entries when applicable - minimizes alignment gap for values - calls postblit on aa.keys and aa.values

rainers · 2015-04-21T06:33:20Z

Looks good to a first cursory inspection.

Can we add some hint to a debugger to figure out what version of the AA implementation is used? Both cv2pdb and mago rebuild the internals to display AA elements.

MartinNowak · 2015-04-21T06:35:17Z

Can we add some hint to a debugger to figure out what version of the AA implementation is used?

What kind of hint would work?

rainers · 2015-04-21T06:36:19Z

src/rt/aaA.d

-    }
+        // set hash and blit value
+        auto pdst = p.entry + off;
+        pdst[0 .. valsz] = pval[0 .. valsz];


Could you add a comment why no postblit is necessary here?

Sure, IIRC the compiler already inserts the necessary postblits for lvalues when constructing an AA literal. Should add a unit test for that anyhow.

Done, that actually revealed a bug in the existing AA.
When values get overwritten during AA literal construction they need to be destroyed.

rainers · 2015-04-21T06:46:15Z

What kind of hint would work?

I think a global version variable should work, as globals are usually not split into seperate COMDATs, so it would not be stripped by the linker:

immutable int AA_version = 1;

I just tried this, it is built into the same section as the module info.

- so that debuggers know how to pretty-print the content of an AA

MartinNowak · 2015-04-21T22:17:19Z

immutable int AA_version = 1;

Done as extern(C) immutable int _aaVersion = 1;.

- also destroy values before overwriting them (due to duplicate keys) during literal construction

rainers · 2015-04-22T21:16:02Z

Should we move the keys into their own array in lock-step with the bucket array? That would avoid the tiEntry machinery which gets even worse when trying to add RTInfo for a precise GC.

Pro:

It might need less memory without padding for value alignment.
A lot of indirections could be avoided during lookups

Con:

It could eat more memory for larger key types
Destructors will run on unused keys that are in their init state.

rainers · 2015-04-22T21:26:04Z

I think this AA implementation is a nice improvement and fixes a number of issues regarding postblit and destructor calls. It is not a small change, though, so I'd like some AA experts to chime in.

MartinNowak · 2015-04-22T23:00:45Z

Should we move the keys into their own array in lock-step with the bucket array?

Will try if that helps, though I'd like to avoid further optimization discussions as we can always make things faster tomorrow.

I think this AA implementation is a nice improvement and fixes a number of issues regarding postblit and destructor calls. It is not a small change, though, so I'd like some AA experts to chime in.

The implementation is state of the art for hash tables, see https://google-sparsehash.googlecode.com/svn/trunk/doc/implementation.html or http://llvm.org/docs/doxygen/html/DenseMap_8h_source.html.

The main intention here is to improve the old implementation so that we have a solid AA while transitioning to a library implementation. Only that allows to get a fast AA.

rainers · 2015-04-24T12:32:48Z

The implementation is state of the art for hash tables

Then I'm slightly disappointed by the limited improvement that the benchmarks show.

It seems there are no objections otherwise, though, so let's get this in. The omission of the next field in Entry will probably help the precise GC in the benchmarks, too. ;-)

rainers · 2015-04-24T12:33:12Z

Auto-merge toggled on

fix Issue 14385 - AA should use open addressing hash

MartinNowak · 2015-04-24T14:59:27Z

Then I'm slightly disappointed by the limited improvement that the benchmarks show.

Yeah, I was a little disappointed as well, the effect is a bit bigger than it appears b/c only a fraction of the benchmark time is actually spent in the AA. Just added another benchmark #1230.

Lots of more improvements are possible with a library AA.
I have the idea to add one as core.aa that's compatible to the builtin one. Let's see whether that works out.

rainers · 2015-04-24T19:22:36Z

Hmmm, while trying this with the precise GC I noticed that some benchmarks have become considerably slower on Win32 with the new AA implementation, namely bulk (0.290s -> 0.328s), resize (0.388s -> 0.418) and especially testgc3 (1.804s -> 2.203s).

Running the benchmarks for Win64 yields results similar to yours.

MartinNowak · 2015-04-25T16:11:48Z

Hmmm, while trying this with the precise GC I noticed that some benchmarks have become considerably slower on Win32 with the new AA implementation, namely bulk (0.290s -> 0.328s), resize (0.388s -> 0.418) and especially testgc3 (1.804s -> 2.203s).

Will investigate this.

MartinNowak · 2015-04-25T20:15:13Z

Particularly testgc3 and bulk cause different GC pool allocations, causing a big part of the difference.

Then I'm also seeing a slight increase in stalled frontend cycles, the mix function seems to be responsible for that.
https://github.com/D-Programming-Language/druntime/pull/1229/files#diff-fdc0da51523ff831dd6cbe33a5bb8b4cR294
I'd very much like to use some bitshift mix instead of a multiplication but I couldn't find one that's good enough. We could move this to TypeInfo_int/ptr though, because strings and structs are well hashed already, and also doing some more work in those virtual functions helps to fill the pipeline.

rainers · 2015-04-26T07:26:25Z

Particularly testgc3 and bulk cause different GC pool allocations, causing a big part of the difference.

I expected the new version to use less memory for 64-bit, but it seems not to change: both versions of testgc3 need 247 MB as reported by the GC, though that's probably much more than the actual live memory.
The old implementation had some bad alignment requirements that caused each entry to allocate 64 byte. This should be down to 16 now, but every entry has at least one 16 byte entry in the bucket array.

The 32-bit versions both show 117 MB memory usage, and no big change in garbage collection time.

Then I'm also seeing a slight increase in stalled frontend cycles, the mix function seems to be responsible for that.

I tried to replace the mix function with a noop, but it did not have a big effect. Not sure how much the testgc3 requires a good hash, though.

I also noticed your replacement of rep stosb (it seems very slow on my mobile system, too. According to Agner Fog, these operations have a quite large setup time and only pay off for large block operations). Replacing the memset and the memcpy with a inlinable version tweaked for short sizes improves the benchmark by 5-10%.

MartinNowak · 2015-04-26T21:46:49Z

Replacing the memset and the memcpy with a inlinable version tweaked for short sizes improves the benchmark by 5-10%.

Might make sense for <4 bytes, but only if inlineable, because memset does the same check.
Issue 14458 – very slow ubyte[] assignment (dmd doesn't use memset)

MartinNowak · 2015-05-02T19:53:03Z

OK, the difference seems to come from the size increase of the buckets array. For the testgc3 AAs (with 200 elements) the bucket array is now pushed into the large alloc size class (4096 bytes) which explains the slowdown.
Will try to split the bucket array into an array for hashes and one for the pointers. Which allows to save 25% mem on x64 by using only a 32-bit hash but might incur a 2nd cache miss.

rainers · 2015-05-02T20:02:20Z

I recently tried a few modifications, too. The new implementation grows 3 times for the 200 entries, while the old one did this only once. Starting with an initial bucket size of 64 causes also a single rehashing, but it did not make a large difference.

Will try to split the bucket array into an array for hashes and one for the pointers. Which allows to save 25% mem on x64 by using only a 32-bit hash but might incur a 2nd cache miss.

I thought about that, too, but did not try it. Event with size_t hashes, it could avoid an additional cache line read if the hash does not match.

rainers · 2015-05-02T20:06:43Z

src/rt/aaA.d

+    if (auto p = aa.findSlotLookup(hash, pkey, ti.key))
+        return p.entry + aa.keysz + aa.keypad;
+
+    auto p = aa.findSlotInsert(hash);


This does a lookup very similar to findSlotLookup just before. These can be combined, but when trying that, there was only a small improvement.

I separated them because the combined function gets too complex.

MartinNowak · 2015-05-04T02:04:55Z

I thought about that, too, but did not try it. Event with size_t hashes, it could avoid an additional cache line read if the hash does not match.

I tried this and it's quite a bit faster for testgc3 on 32-bit, though still considerably slower than the old hash. It's definitely slower for anything else because lookup now requires 2 distinct memory accesses.
The main reason it's faster for testgc3 is because the bucket array no longer falls into the largealloc class.
The main reason testgc3 is slower is the additional rehashing though. The benchmark actually uses less instructions but I see a lot of stalled cycles due to cache misses when rehashing. This effect is amplified because testgc3 crates thousands of AAs and fills all of them with 200 elements.
I already made a compromise and set the growth factor to 4 (instead of 2), which treats mem waste for CPU time. I don't think growing by a factor of 10 as in the old implementation is reasonable.

…> 1.918 s

rainers · 2015-07-28T07:17:52Z

Done as extern(C) immutable int _aaVersion = 1;

I just tried to read this variable from within mago, but unfortunately it is not linked into the binary. It seems the module info isn't there at all, even if I add a class to the module. How's the class factory supposed to work without it?

If I add an empty "shared static this()" it works. Should we add it here? Maybe instead of a version for AA, a dmd/druntime/phobos version somewhere in the binary might also be useful to adapt tooling based on the release version.

fix Issue 14385 - AA should use open addressing hash

79bc91b

- new AA implementation - uses open addressing with quadratic probing (triangular numbers) and pow2 table - uses NO_SCAN for entries when applicable - minimizes alignment gap for values - calls postblit on aa.keys and aa.values

MartinNowak force-pushed the open_addressing branch from d8e65bc to 79bc91b Compare April 21, 2015 01:48

rainers reviewed Apr 21, 2015
View reviewed changes

add AA layout version number (_aaVersion)

2532129

- so that debuggers know how to pretty-print the content of an AA

comment and unittest for move of AA literal keys&values

88ee8db

- also destroy values before overwriting them (due to duplicate keys) during literal construction

MartinNowak force-pushed the open_addressing branch from a554721 to 88ee8db Compare April 21, 2015 22:19

rainers added a commit that referenced this pull request Apr 24, 2015

Merge pull request #1229 from MartinNowak/open_addressing

6698ee2

fix Issue 14385 - AA should use open addressing hash

rainers merged commit 6698ee2 into dlang:master Apr 24, 2015

MartinNowak deleted the open_addressing branch April 24, 2015 14:33

schuetzm mentioned this pull request Apr 26, 2015

Fix unittest failures caused by changed iteration order of AAs vibe-d/vibe.d#1077

Merged

rainers reviewed May 2, 2015
View reviewed changes

MartinNowak referenced this pull request in rainers/druntime May 14, 2015

split hash and entries into separate arrays: win32: testgc3 2.073 s -…

603dbc2

…> 1.918 s

rainers mentioned this pull request Jul 30, 2015

update VisualD to latest stable 0.3.41 dlang/installer#129

Merged

MartinNowak added changelog_v2.068 labels Aug 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix Issue 14385 - AA should use open addressing hash #1229

fix Issue 14385 - AA should use open addressing hash #1229

MartinNowak commented Apr 21, 2015

MartinNowak commented Apr 21, 2015

rainers commented Apr 21, 2015

MartinNowak commented Apr 21, 2015

rainers Apr 21, 2015

MartinNowak Apr 21, 2015

MartinNowak Apr 21, 2015

rainers commented Apr 21, 2015

MartinNowak commented Apr 21, 2015

rainers commented Apr 22, 2015

rainers commented Apr 22, 2015

MartinNowak commented Apr 22, 2015

rainers commented Apr 24, 2015

rainers commented Apr 24, 2015

MartinNowak commented Apr 24, 2015

rainers commented Apr 24, 2015

MartinNowak commented Apr 25, 2015

MartinNowak commented Apr 25, 2015

rainers commented Apr 26, 2015

MartinNowak commented Apr 26, 2015

MartinNowak commented May 2, 2015

rainers commented May 2, 2015

rainers May 2, 2015

MartinNowak May 4, 2015

MartinNowak commented May 4, 2015

rainers commented Jul 28, 2015

fix Issue 14385 - AA should use open addressing hash #1229

fix Issue 14385 - AA should use open addressing hash #1229

Conversation

MartinNowak commented Apr 21, 2015

MartinNowak commented Apr 21, 2015

rainers commented Apr 21, 2015

MartinNowak commented Apr 21, 2015

rainers Apr 21, 2015

Choose a reason for hiding this comment

MartinNowak Apr 21, 2015

Choose a reason for hiding this comment

MartinNowak Apr 21, 2015

Choose a reason for hiding this comment

rainers commented Apr 21, 2015

MartinNowak commented Apr 21, 2015

rainers commented Apr 22, 2015

rainers commented Apr 22, 2015

MartinNowak commented Apr 22, 2015

rainers commented Apr 24, 2015

rainers commented Apr 24, 2015

MartinNowak commented Apr 24, 2015

rainers commented Apr 24, 2015

MartinNowak commented Apr 25, 2015

MartinNowak commented Apr 25, 2015

rainers commented Apr 26, 2015

MartinNowak commented Apr 26, 2015

MartinNowak commented May 2, 2015

rainers commented May 2, 2015

rainers May 2, 2015

Choose a reason for hiding this comment

MartinNowak May 4, 2015

Choose a reason for hiding this comment

MartinNowak commented May 4, 2015

rainers commented Jul 28, 2015