WIP/RFC: Refactored dict.jl #7348

mauro3 · 2014-06-21T13:26:20Z

This is a refactor of dict.jl which also takes into account
DataStructures.jl/OrderedDict. The reason I did this is because I
needed a WeakObjectIdDict for PR #5572 (which I did there with an ugly
copy-paste). PR #5572 is for issue #3988 which should lay the
foundation for user-defined help.

Dictionaries have quite extensive internal mechanics and small changes
in those internals make new types of dicts. For instance, whether to
take a straight hash or use an object-id in hashindex makes a Dict
vs ObjectIdDict. This means that when making new dictionaries by
simply wrapping the standard Dict and defining new setindex!,
getindex, etc. one has to replicate lots of those inner mechanics
because dispatch cannot do its magic on inner functions. This PR is a
go at writing dict.jl in a more composable manner.

This refactor can accommodate quite a few different hash-table based
dicts. Implemented are Dict, ObjectIdDict2 (a new variation),
WeakKeyDict (now works for immutables too), WeakObjectIdDict (new),
and OrderedDict (essentially as in DataStructures.jl package).

Performance is as before except for ObjectIdDict2 which is 25% slower
than ObjectIdDict (but has the complete interface). This is because
ObjectIdDict is implemented differently to Dict whereas ObjectIdDict2
is within the same framework.

I'm not sure the style is in line with standard practices. The
concrete subtypes of HashDictionary and their constructors are made
with a macro. This is reminiscent of the approach of @kmsquire in
PR #2548 which was not liked by @JeffBezanson...

If this is not the way to cleanly implement a WeakObjectIdDict (and it
turns out its needed to solve #3988), then any comments on how to
implement it instead are welcome!

What I did:

Introduced a new abstract type HashDictionary which is the
supertype of all implemented dicts at the moment and ought to be
the supertype of all hash-table based dicts.
I moved some of the Dict specific methods to Associative.
I introduced key and value getters and setters (gkey, skey!,
...). These allow, for instance, keys to be transformed to
WeakRef's and back.
hashindex now is dispatched on the type so, for instance,
ObjectIdDict can change it to use object_id instead.
isequal is also dispatched on with the method isequalkey.
I added a purging of WeakRef, even for immutables, if their object
get gc-ed with the function topurge.
the concrete types are then constructed with @makeHashDictionary
and some of above internals are specialized.

Then to produce the standard Dict:

@makeHashDictionary(Dict, K, K, V, V, Unordered)

To make an object-id dict only a few extras are needed:

@makeHashDictionary(ObjectIdDict2, K, K, V, V, Unordered)
hashindex(::ObjectIdDict2, key, sz) = (int(object_id(key)) & (sz-1)) + 1
function keyconvert{K,V}(h::OIdDicts{K,V}, key0) # no conversion as that can create a new object.
    !isa(key0, K) ? error(key0, " is not a valid Object-Id-Dict key for type ", K) : key0
end
isequal(::OIdDicts, key1, key2) = key1===key2

Note 95% of this is a refactor of good code others wrote, thanks to all those others!

TODO:

check code-coverage and add tests
add performance tests
think about constructors
see whether ObjectIdDict2 can be made compatible with ObjectIdDict

TODO irrespective of this PR:

currently there is an inconsistency between key conversion in get
and getindex. (some do an explicit conversion others don't). (see below for discussion)
at the moment the top-level abstract type is called Associative.
Maybe Dictionary would be more consistent?
equality: == when keys and values are equal or also when type of
dict is equal? ObjectIdDict seems to get special treatment at the moment.
could the standard Dict be made as performant as ObjectIdDict,
i.e. 25% faster? (almost there now)

(edited 24 June)

StefanKarpinski · 2014-06-21T15:43:39Z

Very cool!

kmsquire · 2014-06-21T17:13:48Z

Hi Mauro, thanks for taking this on! I'll add a few comments inline.

kmsquire · 2014-06-21T17:15:36Z

base/dict.jl

+const ISEMPTY = 0x0
+const ISFILLED = 0x1
+const ISMISSING = 0x2
+


Consider just using EMPTY, FILLED, MISSING. Most other uses of is* in Julia are predicates.

(To prevent leakage of common names, you could put this file in its own module.)

kmsquire · 2014-06-21T17:40:58Z

This looks quite nice, @mauro3! I don't have any substantive comments.

mauro3 · 2014-06-21T19:12:31Z

Thanks Stefan and Kevin for the encouraging comments. I changed those ISEMPTY. (I also did some minor edits in my description above).

JeffBezanson · 2014-06-21T20:34:58Z

base/dict.jl

+    end
+end
+
+function deserialize{K,V}(s, T::Type{Associative{K,V}})


I don't think this will work, due to invariance. If T is Dict{K,V}, this method does not match.

JeffBezanson · 2014-06-21T20:38:19Z

How is performance compared to the current Dict? We have a bit of a performance regression in Dicts in 0.3, which I am very concerned about.

mauro3 · 2014-06-21T21:11:01Z

Performance of Dict seems to be the same as before, maybe even a bit better. The results so far are from tests based on an adaption of the current test/dict.jl and tests Kevin put together a wee while ago. Here the script: https://gist.github.com/mauro3/00ff8016bf909de3f4d1

(I'll put proper tests together which can go into test/perf next week and report.)

Dict and ObjectIdDict are the originals, TOREP* are the new ones. First line is timing the test/dict.jl-adaption, second line are Kevin's tests:

45.7 Dict{K,V}
499.8 Dict{K,V}

26.4 ObjectIdDict
334.0 ObjectIdDict

38.5 TOREPDict{K,V}
500.2 TOREPDict{K,V}

35.2 TOREPObjectIdDict{K,V}
435.3 TOREPObjectIdDict{K,V}

The new Dict looks fine, but we should definitely keep the old ObjectIdDict.

JeffBezanson · 2014-06-21T21:26:38Z

The explicit conversion question is interesting. If it's the case that the input and output of convert are always equal, then of course it's not necessary, but my impression is that requiring that would be too strict. It should be enough to convert and check only on assignment, which ensures that doing a lookup with the same object as key will work.

JeffBezanson · 2014-06-21T21:39:32Z

Since get! might assign new keys, the easiest thing for it to do is convert first. It could look up the key without conversion first to see if it's found, but if the key's not found it would have to convert and hash it again.

JeffBezanson · 2014-06-23T04:52:04Z

What would you attribute the slight performance increase to? (from Dict to TOREPDict)
If anything the new version seems to do slightly more operations. Or is the performance difference not statistically significant?

mauro3 · 2014-06-23T12:15:22Z

I put some performance test together in test/perf/dict/perf.jl and ran them for this PR, upstream and 0.2.1.: https://gist.github.com/mauro3/daddb89c29b7f5e5cd59

Performance of this PR vs upstream is the same within error. I think that is because the extra function in this PR are no-opts for Dicts and thus are probably optimized away.

Somewhat puzzling/interesting is the performance difference between ObjectIdDict2 vs Dict (ObjectIdDict2 is identical to Dict except for the hashindex function): the "_unitt" test runs 30% faster, the "_del" deletion test 15% faster.

OrderedDict is as fast as Dict. Weak-dicts are slower in insertion and iteration. I'll see what this is about.

Upstream vs 0.2.1: about the same except for the _unitt and iteration test, where there is about a 20% performance regression.

sure how to do convert in general.

mauro3 · 2014-06-23T21:43:25Z

@JeffBezanson: yes, you're right about key conversion because isequal implies the same hash, which I was not aware of: http://docs.julialang.org/en/latest/stdlib/base/?highlight=concrete#Base.hash.

mauro3 · 2014-06-24T14:44:35Z

Some random info about the not-key-conversion in the getters in upstream Dict:

julia> d = Dict{Symbol, Int}()
Dict{Symbol,Int64} with 0 entries

julia> get(d, 84, 8)
8

julia> get!(d, 84, 8)
ERROR: no method convert(Type{Symbol}, Int64)
 in get! at dict.jl:564

julia> get(d, :t, :a)
:a

julia> get!(d, :t, :a)
ERROR: no method convert(Type{Int64}, Symbol)
 in get! at dict.jl:573

julia> d[6]
ERROR: key not found: 6
 in getindex at dict.jl:615

So, the get does not throw any errors when the key or default is of the wrong type, whereas get! does. Also getindex thows a key not found error and not a ERROR: no method convert(Type{Int64}, Symbol).

I guess this is fine either way. In fact, base relies on the current behavior and it took me ages to figure out why adding key-conversion to the getters leads to a build error.

mauro3 · 2014-06-24T16:36:44Z

I ran some more perfomance tests but now also using Ints as key as well as ASCIIStrings. They are much faster, making Dict faster than ObjectIdDict. Profiling shows that the strings' isequal is the culprit. Here this results:
https://gist.github.com/mauro3/daddb89c29b7f5e5cd59#file-dict-perf-str-int

Note iteration: Dict{Int} is 10x faster than Dict{ASCIIString}, which is again 2x faster than ObjectIdDict.

Based on these tests I made the isequal function replacable too. Now my ObjectIdDict2 uses === for comparison (which is in fact also correct!). Now performance of my ObjectIdDict2{ASCIIString,Int} is almost on par with ObjectIdDict:
https://gist.github.com/mauro3/daddb89c29b7f5e5cd59#file-obid-dicts

JeffBezanson · 2014-06-24T20:53:24Z

Iterating over an ObjectIdDict is way too slow. I can fix that.

mauro3 · 2014-06-25T13:20:30Z

Here the new perfomance test for @JeffBezanson's updated ObjectIdDict: https://gist.github.com/mauro3/daddb89c29b7f5e5cd59#file-jjeffs-new-objectiddict

Much faster now. Looping over (ASSCIString, Int) dict is about as fast as the normal Dict, over (Int, Int) still about 10x slower. Which is, I think, because type inference does not work so well for the ObjectIdDict as it is not parameterized on the types of the keys and values.

mauro3 · 2014-09-30T22:04:22Z

Any interest in this? Should I rebase it?

JeffBezanson · 2014-12-08T20:27:35Z

We can probably get rid of the macro now that it's possible to define constructors for types with any number of unspecified parameters. We can have Dict{K,V,Hash,Weak,Order} and hide the last 3 parameters.

Needs to be updated for the new style of Dict constructors/literals.

This also looks like a good opportunity to implement #9028 (comment) . Making a closure and/or finalizer for each element of a weak Dict can't be good for performance.

mauro3 · 2014-12-08T20:41:12Z

Cool, I'll have a look at it. Probably will be after Christmas though.

mauro3 · 2015-01-08T15:33:11Z

I had a look at issue #8712 to figure out what "now that it's possible to define constructors for types with any number of unspecified parameters" means but could not figure it out. Waiting for documentation: #9680

hayd · 2015-05-27T06:01:17Z

How does this sit with/after #10116? Presumably this'll look different after that...

Perhaps some of these perf tests are useful for #10116?

mauro3 · 2015-06-28T22:17:45Z

Closing. See #10116. Performance test I ran over the ordered dict: #10116 (comment)

kmsquire reviewed Jun 21, 2014
View reviewed changes

JeffBezanson reviewed Jun 21, 2014
View reviewed changes

mauro3 added 2 commits June 23, 2014 12:38

Refactored dict.jl

e925b26

Added preformance tests.

e73799d

Fixed invarinace problem in deserialize and convert. However, not

eeb1821

sure how to do convert in general.

Updated dict/perf.jl, needs some deleting before merge.

ede4b2e

Now dispatching on isequal as well. And fixed a bug in ObjectIdDict2.

48867ce

jiahao force-pushed the master branch 3 times, most recently from 6c7c7e3 to 1a4c02f Compare October 11, 2014 22:06

jiahao force-pushed the master branch from cdde4df to 7fdc860 Compare October 28, 2014 04:20

MikeInnes force-pushed the master branch from 5c60996 to b1c3df3 Compare November 14, 2014 17:07

mauro3 mentioned this pull request May 31, 2015

WIP: try ordered Dict representation #10116

Closed

mauro3 closed this Jun 28, 2015

mauro3 mentioned this pull request Jul 23, 2015

custom hashing is too easy to accidentally break #12198

Open

mauro3 mentioned this pull request Nov 23, 2017

WIP: remove fallback hash method. fixes #12198 #24354

Closed

mauro3 mentioned this pull request Dec 5, 2017

WIP/RFC: Adding EgalDict{K,V} (aka ObjectIdDict{K,V}) #24932

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/RFC: Refactored dict.jl #7348

WIP/RFC: Refactored dict.jl #7348

mauro3 commented Jun 21, 2014

StefanKarpinski commented Jun 21, 2014

kmsquire commented Jun 21, 2014

kmsquire Jun 21, 2014

kmsquire commented Jun 21, 2014

mauro3 commented Jun 21, 2014

JeffBezanson Jun 21, 2014

JeffBezanson commented Jun 21, 2014

mauro3 commented Jun 21, 2014

JeffBezanson commented Jun 21, 2014

JeffBezanson commented Jun 21, 2014

JeffBezanson commented Jun 23, 2014

mauro3 commented Jun 23, 2014

mauro3 commented Jun 23, 2014

mauro3 commented Jun 24, 2014

mauro3 commented Jun 24, 2014

JeffBezanson commented Jun 24, 2014

mauro3 commented Jun 25, 2014

mauro3 commented Sep 30, 2014

JeffBezanson commented Dec 8, 2014

mauro3 commented Dec 8, 2014

mauro3 commented Jan 8, 2015

hayd commented May 27, 2015

mauro3 commented Jun 28, 2015

WIP/RFC: Refactored dict.jl #7348

WIP/RFC: Refactored dict.jl #7348

Conversation

mauro3 commented Jun 21, 2014

StefanKarpinski commented Jun 21, 2014

kmsquire commented Jun 21, 2014

kmsquire Jun 21, 2014

Choose a reason for hiding this comment

kmsquire commented Jun 21, 2014

mauro3 commented Jun 21, 2014

JeffBezanson Jun 21, 2014

Choose a reason for hiding this comment

JeffBezanson commented Jun 21, 2014

mauro3 commented Jun 21, 2014

JeffBezanson commented Jun 21, 2014

JeffBezanson commented Jun 21, 2014

JeffBezanson commented Jun 23, 2014

mauro3 commented Jun 23, 2014

mauro3 commented Jun 23, 2014

mauro3 commented Jun 24, 2014

mauro3 commented Jun 24, 2014

JeffBezanson commented Jun 24, 2014

mauro3 commented Jun 25, 2014

mauro3 commented Sep 30, 2014

JeffBezanson commented Dec 8, 2014

mauro3 commented Dec 8, 2014

mauro3 commented Jan 8, 2015

hayd commented May 27, 2015

mauro3 commented Jun 28, 2015