Skip to content
This repository has been archived by the owner on Oct 12, 2022. It is now read-only.

fix Issue 14385 - AA should use open addressing hash #1229

Merged
merged 3 commits into from
Apr 24, 2015

Conversation

MartinNowak
Copy link
Member

  • new AA implementation
  • uses open addressing with quadratic probing (triangular numbers) and pow2 table
  • uses NO_SCAN for entries when applicable
  • minimizes alignment gap for values
  • calls postblit on aa.keys and aa.values

Issue 14385 – AA should use open addressing hash

@MartinNowak
Copy link
Member Author

chart

The conmsg benchmark suffers from increased congestion on the GC lock, because I removed the small inplace AA buckets which take quite a lot of extra memory even though they aren't used for bigger AAs. The congestion problem will get fixed by introducing GC thread-caches.

- new AA implementation
- uses open addressing with quadratic probing (triangular numbers) and pow2 table
- uses NO_SCAN for entries when applicable
- minimizes alignment gap for values
- calls postblit on aa.keys and aa.values
@rainers
Copy link
Member

rainers commented Apr 21, 2015

Looks good to a first cursory inspection.

Can we add some hint to a debugger to figure out what version of the AA implementation is used? Both cv2pdb and mago rebuild the internals to display AA elements.

@MartinNowak
Copy link
Member Author

Can we add some hint to a debugger to figure out what version of the AA implementation is used?

What kind of hint would work?

}
// set hash and blit value
auto pdst = p.entry + off;
pdst[0 .. valsz] = pval[0 .. valsz];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment why no postblit is necessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, IIRC the compiler already inserts the necessary postblits for lvalues when constructing an AA literal. Should add a unit test for that anyhow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, that actually revealed a bug in the existing AA.
When values get overwritten during AA literal construction they need to be destroyed.

@rainers
Copy link
Member

rainers commented Apr 21, 2015

What kind of hint would work?

I think a global version variable should work, as globals are usually not split into seperate COMDATs, so it would not be stripped by the linker:

immutable int AA_version = 1;

I just tried this, it is built into the same section as the module info.

- so that debuggers know how to pretty-print the content of an AA
@MartinNowak
Copy link
Member Author

immutable int AA_version = 1;

Done as extern(C) immutable int _aaVersion = 1;.

- also destroy values before overwriting them (due to duplicate keys)
  during literal construction
@rainers
Copy link
Member

rainers commented Apr 22, 2015

Should we move the keys into their own array in lock-step with the bucket array? That would avoid the tiEntry machinery which gets even worse when trying to add RTInfo for a precise GC.

Pro:

  • It might need less memory without padding for value alignment.
  • A lot of indirections could be avoided during lookups

Con:

  • It could eat more memory for larger key types
  • Destructors will run on unused keys that are in their init state.

@rainers
Copy link
Member

rainers commented Apr 22, 2015

I think this AA implementation is a nice improvement and fixes a number of issues regarding postblit and destructor calls. It is not a small change, though, so I'd like some AA experts to chime in.

@MartinNowak
Copy link
Member Author

Should we move the keys into their own array in lock-step with the bucket array?

Will try if that helps, though I'd like to avoid further optimization discussions as we can always make things faster tomorrow.

I think this AA implementation is a nice improvement and fixes a number of issues regarding postblit and destructor calls. It is not a small change, though, so I'd like some AA experts to chime in.

The implementation is state of the art for hash tables, see https://google-sparsehash.googlecode.com/svn/trunk/doc/implementation.html or http://llvm.org/docs/doxygen/html/DenseMap_8h_source.html.

The main intention here is to improve the old implementation so that we have a solid AA while transitioning to a library implementation. Only that allows to get a fast AA.

@rainers
Copy link
Member

rainers commented Apr 24, 2015

The implementation is state of the art for hash tables

Then I'm slightly disappointed by the limited improvement that the benchmarks show.

It seems there are no objections otherwise, though, so let's get this in. The omission of the next field in Entry will probably help the precise GC in the benchmarks, too. ;-)

@rainers
Copy link
Member

rainers commented Apr 24, 2015

Auto-merge toggled on

rainers added a commit that referenced this pull request Apr 24, 2015
fix Issue 14385 - AA should use open addressing hash
@rainers rainers merged commit 6698ee2 into dlang:master Apr 24, 2015
@MartinNowak MartinNowak deleted the open_addressing branch April 24, 2015 14:33
@MartinNowak
Copy link
Member Author

Then I'm slightly disappointed by the limited improvement that the benchmarks show.

Yeah, I was a little disappointed as well, the effect is a bit bigger than it appears b/c only a fraction of the benchmark time is actually spent in the AA. Just added another benchmark #1230.

Lots of more improvements are possible with a library AA.
I have the idea to add one as core.aa that's compatible to the builtin one. Let's see whether that works out.

@rainers
Copy link
Member

rainers commented Apr 24, 2015

Hmmm, while trying this with the precise GC I noticed that some benchmarks have become considerably slower on Win32 with the new AA implementation, namely bulk (0.290s -> 0.328s), resize (0.388s -> 0.418) and especially testgc3 (1.804s -> 2.203s).

Running the benchmarks for Win64 yields results similar to yours.

@MartinNowak
Copy link
Member Author

Hmmm, while trying this with the precise GC I noticed that some benchmarks have become considerably slower on Win32 with the new AA implementation, namely bulk (0.290s -> 0.328s), resize (0.388s -> 0.418) and especially testgc3 (1.804s -> 2.203s).

Will investigate this.

@MartinNowak
Copy link
Member Author

Particularly testgc3 and bulk cause different GC pool allocations, causing a big part of the difference.

Then I'm also seeing a slight increase in stalled frontend cycles, the mix function seems to be responsible for that.
https://github.com/D-Programming-Language/druntime/pull/1229/files#diff-fdc0da51523ff831dd6cbe33a5bb8b4cR294
I'd very much like to use some bitshift mix instead of a multiplication but I couldn't find one that's good enough. We could move this to TypeInfo_int/ptr though, because strings and structs are well hashed already, and also doing some more work in those virtual functions helps to fill the pipeline.

@rainers
Copy link
Member

rainers commented Apr 26, 2015

Particularly testgc3 and bulk cause different GC pool allocations, causing a big part of the difference.

I expected the new version to use less memory for 64-bit, but it seems not to change: both versions of testgc3 need 247 MB as reported by the GC, though that's probably much more than the actual live memory.
The old implementation had some bad alignment requirements that caused each entry to allocate 64 byte. This should be down to 16 now, but every entry has at least one 16 byte entry in the bucket array.

The 32-bit versions both show 117 MB memory usage, and no big change in garbage collection time.

Then I'm also seeing a slight increase in stalled frontend cycles, the mix function seems to be responsible for that.

I tried to replace the mix function with a noop, but it did not have a big effect. Not sure how much the testgc3 requires a good hash, though.

I also noticed your replacement of rep stosb (it seems very slow on my mobile system, too. According to Agner Fog, these operations have a quite large setup time and only pay off for large block operations). Replacing the memset and the memcpy with a inlinable version tweaked for short sizes improves the benchmark by 5-10%.

@MartinNowak
Copy link
Member Author

Replacing the memset and the memcpy with a inlinable version tweaked for short sizes improves the benchmark by 5-10%.

Might make sense for <4 bytes, but only if inlineable, because memset does the same check.
Issue 14458 – very slow ubyte[] assignment (dmd doesn't use memset)

@MartinNowak
Copy link
Member Author

OK, the difference seems to come from the size increase of the buckets array. For the testgc3 AAs (with 200 elements) the bucket array is now pushed into the large alloc size class (4096 bytes) which explains the slowdown.
Will try to split the bucket array into an array for hashes and one for the pointers. Which allows to save 25% mem on x64 by using only a 32-bit hash but might incur a 2nd cache miss.

@rainers
Copy link
Member

rainers commented May 2, 2015

I recently tried a few modifications, too. The new implementation grows 3 times for the 200 entries, while the old one did this only once. Starting with an initial bucket size of 64 causes also a single rehashing, but it did not make a large difference.

Will try to split the bucket array into an array for hashes and one for the pointers. Which allows to save 25% mem on x64 by using only a 32-bit hash but might incur a 2nd cache miss.

I thought about that, too, but did not try it. Event with size_t hashes, it could avoid an additional cache line read if the hash does not match.

if (auto p = aa.findSlotLookup(hash, pkey, ti.key))
return p.entry + aa.keysz + aa.keypad;

auto p = aa.findSlotInsert(hash);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does a lookup very similar to findSlotLookup just before. These can be combined, but when trying that, there was only a small improvement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I separated them because the combined function gets too complex.

@MartinNowak
Copy link
Member Author

I thought about that, too, but did not try it. Event with size_t hashes, it could avoid an additional cache line read if the hash does not match.

I tried this and it's quite a bit faster for testgc3 on 32-bit, though still considerably slower than the old hash. It's definitely slower for anything else because lookup now requires 2 distinct memory accesses.
The main reason it's faster for testgc3 is because the bucket array no longer falls into the largealloc class.
The main reason testgc3 is slower is the additional rehashing though. The benchmark actually uses less instructions but I see a lot of stalled cycles due to cache misses when rehashing. This effect is amplified because testgc3 crates thousands of AAs and fills all of them with 200 elements.
I already made a compromise and set the growth factor to 4 (instead of 2), which treats mem waste for CPU time. I don't think growing by a factor of 10 as in the old implementation is reasonable.

@rainers
Copy link
Member

rainers commented Jul 28, 2015

Done as extern(C) immutable int _aaVersion = 1;

I just tried to read this variable from within mago, but unfortunately it is not linked into the binary. It seems the module info isn't there at all, even if I add a class to the module. How's the class factory supposed to work without it?

If I add an empty "shared static this()" it works. Should we add it here? Maybe instead of a version for AA, a dmd/druntime/phobos version somewhere in the binary might also be useful to adapt tooling based on the release version.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants