Order of dict can change after serialization (make Dict ordered?) #34265

oxinabox · 2020-01-05T15:07:41Z

@bkamins pointed out on slack:

julia> using Serialization;
julia> d = Dict{Symbol, Vector{Int}}(Symbol.('a':'z') .=> Ref([1]));
julia> serialize("test.bin", d);
julia> d2 = deserialize("test.bin");
julia> hcat(collect.(keys.([d, d2]))...)
26×2 Array{Symbol,2}:
 :o  :j
 :b  :x
 :p  :d
 :n  :k
 :j  :g
 :e  :u
 :c  :r
 :h  :a
 :l  :m
 :w  :y
 :x  :i
 :d  :o
 :k  :b
 :s  :p
 :v  :n
 :g  :e
 :u  :c
 :q  :h
 :r  :l
 :z  :w
 :a  :s
 :f  :v
 :m  :z
 :y  :q
 :i  :f
 :t  :t

But it doesn't have to be this way.

If we redefine things so it remembers how many slots it should have,
then it comes out the same as it came in.


hintsize(dict::AbstractDict) = length(dict)
hintsize(dict::Dict) = length(dict.keys)

function Serialization.deserialize_dict(s::AbstractSerializer, T::Type{<:AbstractDict})
    n = read(s.io, Int32)
    sz = read(s.io, Int32)
    t = T();
    sizehint!(t, sz)
    Serialization.deserialize_cycle(s, t)
    for i = 1:n
        k = deserialize(s)
        v = deserialize(s)
        t[k] = v
    end
    return t
end

function Serialization.serialize_dict_data(s::AbstractSerializer, d::AbstractDict)
    write(s.io, Int32(length(d)))
    write(s.io, Int32(length(d.slots)))
    for (k,v) in d
        serialize(s, k)
        serialize(s, v)
    end
end

But this is annoying because it changes the serialization format.
I would rather change sizehint!(::Dict) or how we call it.
The problem is that sizehint!(Dict(), 26) gives it 32 slots,
but the d had 64 slots.

In python this was one of thing things that really caught me out.
Because python salts its hashes wiith a random salt selected each time it starts.
But julia doesn't.

The text was updated successfully, but these errors were encountered:

oxinabox · 2020-01-05T15:14:29Z

A vaguely justifyable (from what we normally do when rehashing)
is

function Serialization.deserialize_dict(s::AbstractSerializer, T::Type{<:AbstractDict})
    n = read(s.io, Int32)
    t = T();
    sizehint!(t, n > 64000 ? n : n*2)
    Serialization.deserialize_cycle(s, t)
    for i = 1:n
        k = deserialize(s)
        v = deserialize(s)
        t[k] = v
    end
    return t
end

But I don't know that this will always be right -- it depends on the dictionaries history of adds and deletes I guess as how many times it has been resized.

Actually, I guess add and delete paterns might also screw up things even in the solution I posted in the OP.
But maybe we can solve those patterns by rehash!ing before serialization?

bkamins · 2020-01-05T18:11:57Z

My thinking was that it should be rather an update of serialization of Dict, if we decided to change this (other AbstractDicts might have different fields), is just to serialize and deserialize its fields verbatim. This is slightly less efficient but guarantees reproducibility.

KristofferC · 2020-01-05T19:26:09Z

What's the reason why one would care about this? Some caching optimization (tokens) that gets invalidated?

oxinabox · 2020-01-05T19:47:39Z

For my purposes is mostly just counter intuitive.
Its fine for the keys to be in arbitary order.
But i expect them to stay the same across operations like serialization.

Silly example (but not so different from something that actually messed me up in python).

results::Dict{String, Float64}

function convert_to_percent!(res::Dict{String, Float64},)
    for model in keys(res)
        res[model]*=100
    end
    return res
end

function znormalize!(res::Dict{String, Float64},)
    mean_ = mean(values(res))
    std_ = std(values(res))
    for model in keys(res)
        res[model] = (res[model]-mean_)/std_
    end
    return res
end

percent, zscore = pmap([convert_to_percent!, znormalize]) do f
    f(results)
end

df = DataFrame()
df.models = keys(results)
df.scores = values(results)
df.percent = values(percent)
df.zscore = values(zscore)

bkamins · 2020-01-05T20:17:12Z

The reason to have it is reproducibility. A I have discussed with @pszufe a user can reasonably expect that in:

julia> using Distributed

julia> addprocs(1)
1-element Array{Int64,1}:
 2

julia> @everywhere using Distributed

julia> @everywhere using Random

julia> @everywhere Random.seed!(1234)

julia> d = Dict(Symbol.('a':'z') .=> 1:26);

julia> pmap(x -> myid() => rand(x), [d])
1-element Array{Pair{Int64,Pair{Symbol,Int64}},1}:
 2 => (:c => 3)

julia> pmap(x -> myid() => rand(x), [d], distributed=false)
1-element Array{Pair{Int64,Pair{Symbol,Int64}},1}:
 1 => (:u => 21)

you should be able to get the same results in both pmap calls. Otherwise the result depends on where you execute the task to be performed (which preferably should not be the case).

It would even more common if someone uses https://github.com/ChrisRackauckas/ParallelDataTransfer.jl.

andyferris · 2020-01-06T06:21:29Z

Oh wow. Why don't we serialize and deserialize Dict as a simple struct containing arrays? It would seem to be the simplest solution. (But then again, I just realized that I don't know much about serialization in Julia... are there any problems with this?)

If we redefine things so it remembers how many slots it should have, then it comes out the same as it came in.

@oxinabox unfortunately with insertions and deletions it means that even with a given number of slots there is some freedom left in the ordering, so that won't quite work. :(

JeffBezanson · 2020-01-06T06:21:42Z

We could make this backwards-compatible for reading using a trick like this: save the negative of the length, then the sizehint size. If the first number is negative we know we have the new format.

JeffBezanson · 2020-01-06T06:23:09Z

Or we could just make Dicts ordered.

andyferris · 2020-01-06T06:26:11Z

Or we could just make Dicts ordered.

We'd better follow JavaScript semantics then. You know, for consistency.

The iteration order for objects follows a certain set of rules since ES2015, but it does not (always) follow the insertion order. Simply put, the iteration order is a combination of the insertion order for strings keys, and ascending order for number-like keys:
// key order: 1, foo, bar
const obj = { "foo": "foo", "1": "1", "bar": "bar" }

andyferris · 2020-01-06T06:28:05Z

But it's an interesting point, the default Base dictionary could be ordered, it could be useful for supporting sorting of dictionary keys or values out-of-the box, etc.

JeffBezanson · 2020-01-06T06:29:41Z

Unfortunately we can't just save the fields of the Dict struct verbatim, since some hashes might depend on address --- rehashing has to happen on the other machine anyway. So it seems to me that an ordered dict is strictly required to get this property.

andyferris · 2020-01-06T06:36:07Z

since some hashes might depend on address

That's a good point - thanks.

In that case - does anyone know if there some good benchmarks for OrderedCollections.jl in comparison to Dict/Set? It would be interesting to know the performance characteristics.

JeffBezanson · 2020-01-06T06:41:59Z

We did a bunch of work on it in #10116. Iteration is way faster of course (and the change might be worth it just for that). Some other workloads are a bit faster, and IIRC the main cost is that some deletion-heavy workloads are slower.

kmsquire · 2020-01-06T08:54:51Z

Just to add, the implementation in OrderedCollections was derived from #10116 (although the Dict implementation which #10116 was based off of has diverged some since then.)

JeffBezanson · 2020-01-07T19:14:02Z

I'll rename this to reflect the fact that it's equivalent to making Dict ordered, which there is a PR for but no issue AFAICT.

PallHaraldsson · 2020-09-25T23:54:06Z

@JeffBezanson Can we just make Dict to be OrderedDict, for Julia 1.6, to be done with it? Or at least add it as an extra, however we do it, either copy it (your, by now modified code) as stdlib, or the full OrderedCollections.jl, which has also an alternative ordered:

It also implements LittleDict which is a ordered dictionary, that is much faster than any other AbstractDict (ordered or not) for small collections.

My longer argument here: #37761 (comment)

I realize there are not many days left of the month, and Julia 1.6 is due this month. Can we at least change the implementation and merge to get packageeval, and see, it may actually be faster, even for Julia itself. We could always revert before the end of the month.

PallHaraldsson · 2020-09-27T14:15:21Z

I did make a change to Julia (locally, PR was forthcoming, unless people are not interested) for Dict->OrderedDict.

I didn't mean to sound pushy, I just thought, and still think we might want to do a packageeval. I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

andyferris · 2020-09-28T04:04:14Z

I did make a change to Julia (locally, PR was forthcoming, unless people are not interested) for Dict->OrderedDict.

At least I would be interested in seeing a version of Julia using OrderedDict.

I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

I never had any luck making OrderedDict faster for insert/delete heavy workloads.

If ordered hash maps are inherently slower at insertion/deletion then I'd say Julia might want actually both ordered and unordered versions built in. The ordered one being the default dictionary in Base. The unordered one, being better for cache-style workloads, would be used by the compiler (since we care about latency) and if it's in Core.Compiler and of good quality (which it already is) it may as well be exported from Base as well (since users may also have cache-style workloads to optimize). One possible version of this + #37761 that follows this line of reasoning would rename Dict to UnorderedDict globally (including compiler), then add the OrderedDict code to Base but calling it Dict. If packageeval passes on that, well, that would seem very interesting, to me.

KristofferC · 2020-09-28T06:15:38Z

I didn't mean to sound pushy, I just thought, and still think we might want to do a packageeval. I see however immediately that my change made using slower, from 6% to 2x depending on package (checked Revise and OhMyREPL)

That sounds more like an error in the methodology. I highly doubt that loading e.g. OhMyREPL would be an incredibly 2x slower just because of this change to dictionaries so it is likely something else going on.

PallHaraldsson · 2020-09-28T10:21:34Z

Yes, I suspected something about precompiling missing, but I did try using the other package first, and to see if it was only the first using, and I did actually get 2x. I guess I still shouldn't rule out that theory, with only your package hitting some codepath for dicts.

Let's say the slowdown was limited to 6% for using and dicts, would that be ok with for the ordered guarantee?

PallHaraldsson · 2020-09-28T10:44:04Z

Should I push my code as an early ("[WIP]") with the slowdown I get? Instead I was exploring other possibly faster options, and those are currently failing.* I probably should recover the working version, while I remember the changes I made before recompiling again.

I'd say Julia might want actually both ordered and unordered versions built in.

Yes, then it's less important if there is slowdown. I was just trying to to the simplest thing at first, one version, replacing the other, also to see the effect. AND only comparing to the status quo, the current unordered Dict, that's probably outdated:

SwissDict is available, based on Google's:
https://abseil.io/blog/20180927-swisstables

The “flat” Swiss tables should be your default choice.

also Ordered and regular RobinDict, but I wasn't sure about adding more recent code (less tested?).

* https://discourse.julialang.org/t/ordered-dict-in-base-is-littledict-not-thread-safe-and-possibly-the-reason-i-cant-add-it-to-base/47372

adkabo · 2022-05-24T18:50:28Z

Regardless of what data structure gets the name Dict, it is important to keep a distinction between unordered dict and ordered dict.

In ordered dicts, since equality testing depends on order, od1 == od2 implies first(od1) == first(od2), which is good.

In unordered dicts, since iteration order is undefined, ud1 == ud2 makes no promises about which element is first(ud1), so first(ud1) != first(ud2) is fine.

What goes wrong is the "compromise" policy where iteration order is guaranteed but equality does not consider it. Equality should imply that every guaranteed part of the public interface is equal between the two dicts, but if iteration order is not considered then it's possible to have cd1 == cd2 but first(cd1) != first(cd2) even though all of those operations are well defined. This seems absurd to me.

Python 3.7+ dict uses the compromise policy but I wish it didn't.

JeffBezanson changed the title ~~Deserializing a Dict gives a different order~~ make Dict ordered? Jan 7, 2020

JeffBezanson added collections Data structures holding multiple items, e.g. sets needs decision A decision on this change is needed labels Jan 7, 2020

PallHaraldsson mentioned this issue Sep 29, 2020

RFC: Replaced Dict with OrderedDict #37804

Closed

tpapp mentioned this issue Oct 23, 2020

Change Dict to be ordered by default #38145

Closed

KristofferC mentioned this issue Oct 26, 2020

Add ODict to Base for OrderedDict #38163

Closed

oxinabox changed the title ~~make Dict ordered?~~ Order of dict can change after serialization (make Dict ordered?) Nov 26, 2024

oxinabox mentioned this issue Nov 26, 2024

Remove convert method to OrderedDict JuliaCollections/OrderedCollections.jl#139

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Order of dict can change after serialization (make Dict ordered?) #34265

Order of dict can change after serialization (make Dict ordered?) #34265

oxinabox commented Jan 5, 2020

oxinabox commented Jan 5, 2020

bkamins commented Jan 5, 2020

KristofferC commented Jan 5, 2020

oxinabox commented Jan 5, 2020 •

edited

Loading

bkamins commented Jan 5, 2020

andyferris commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

andyferris commented Jan 6, 2020

andyferris commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

andyferris commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

kmsquire commented Jan 6, 2020

JeffBezanson commented Jan 7, 2020

PallHaraldsson commented Sep 25, 2020 •

edited

Loading

PallHaraldsson commented Sep 27, 2020 •

edited

Loading

andyferris commented Sep 28, 2020 •

edited

Loading

KristofferC commented Sep 28, 2020

PallHaraldsson commented Sep 28, 2020 •

edited

Loading

PallHaraldsson commented Sep 28, 2020 •

edited

Loading

adkabo commented May 24, 2022 •

edited

Loading

Order of dict can change after serialization (make Dict ordered?) #34265

Order of dict can change after serialization (make Dict ordered?) #34265

Comments

oxinabox commented Jan 5, 2020

oxinabox commented Jan 5, 2020

bkamins commented Jan 5, 2020

KristofferC commented Jan 5, 2020

oxinabox commented Jan 5, 2020 • edited Loading

bkamins commented Jan 5, 2020

andyferris commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

andyferris commented Jan 6, 2020

andyferris commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

andyferris commented Jan 6, 2020

JeffBezanson commented Jan 6, 2020

kmsquire commented Jan 6, 2020

JeffBezanson commented Jan 7, 2020

PallHaraldsson commented Sep 25, 2020 • edited Loading

PallHaraldsson commented Sep 27, 2020 • edited Loading

andyferris commented Sep 28, 2020 • edited Loading

KristofferC commented Sep 28, 2020

PallHaraldsson commented Sep 28, 2020 • edited Loading

PallHaraldsson commented Sep 28, 2020 • edited Loading

adkabo commented May 24, 2022 • edited Loading

oxinabox commented Jan 5, 2020 •

edited

Loading

PallHaraldsson commented Sep 25, 2020 •

edited

Loading

PallHaraldsson commented Sep 27, 2020 •

edited

Loading

andyferris commented Sep 28, 2020 •

edited

Loading

PallHaraldsson commented Sep 28, 2020 •

edited

Loading

PallHaraldsson commented Sep 28, 2020 •

edited

Loading

adkabo commented May 24, 2022 •

edited

Loading