-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order of dict can change after serialization (make Dict ordered?) #34265
Comments
A vaguely justifyable (from what we normally do when rehashing)
But I don't know that this will always be right -- it depends on the dictionaries history of adds and deletes I guess as how many times it has been resized. Actually, I guess add and delete paterns might also screw up things even in the solution I posted in the OP. |
My thinking was that it should be rather an update of serialization of |
What's the reason why one would care about this? Some caching optimization (tokens) that gets invalidated? |
For my purposes is mostly just counter intuitive. Silly example (but not so different from something that actually messed me up in python).
|
The reason to have it is reproducibility. A I have discussed with @pszufe a user can reasonably expect that in:
you should be able to get the same results in both It would even more common if someone uses https://github.com/ChrisRackauckas/ParallelDataTransfer.jl. |
Oh wow. Why don't we serialize and deserialize
@oxinabox unfortunately with insertions and deletions it means that even with a given number of slots there is some freedom left in the ordering, so that won't quite work. :( |
We could make this backwards-compatible for reading using a trick like this: save the negative of the length, then the sizehint size. If the first number is negative we know we have the new format. |
Or we could just make Dicts ordered. |
We'd better follow JavaScript semantics then. You know, for consistency.
|
But it's an interesting point, the default |
Unfortunately we can't just save the fields of the |
That's a good point - thanks. In that case - does anyone know if there some good benchmarks for OrderedCollections.jl in comparison to |
We did a bunch of work on it in #10116. Iteration is way faster of course (and the change might be worth it just for that). Some other workloads are a bit faster, and IIRC the main cost is that some deletion-heavy workloads are slower. |
I'll rename this to reflect the fact that it's equivalent to making Dict ordered, which there is a PR for but no issue AFAICT. |
@JeffBezanson Can we just make Dict to be OrderedDict, for Julia 1.6, to be done with it? Or at least add it as an extra, however we do it, either copy it (your, by now modified code) as stdlib, or the full OrderedCollections.jl, which has also an alternative ordered:
My longer argument here: #37761 (comment) I realize there are not many days left of the month, and Julia 1.6 is due this month. Can we at least change the implementation and merge to get packageeval, and see, it may actually be faster, even for Julia itself. We could always revert before the end of the month. |
I did make a change to Julia (locally, PR was forthcoming, unless people are not interested) for Dict->OrderedDict. I didn't mean to sound pushy, I just thought, and still think we might want to do a packageeval. I see however immediately that my change made |
At least I would be interested in seeing a version of Julia using OrderedDict.
I never had any luck making OrderedDict faster for insert/delete heavy workloads. If ordered hash maps are inherently slower at insertion/deletion then I'd say Julia might want actually both ordered and unordered versions built in. The ordered one being the default dictionary in |
That sounds more like an error in the methodology. I highly doubt that loading e.g. OhMyREPL would be an incredibly 2x slower just because of this change to dictionaries so it is likely something else going on. |
Yes, I suspected something about precompiling missing, but I did try using the other package first, and to see if it was only the first using, and I did actually get 2x. I guess I still shouldn't rule out that theory, with only your package hitting some codepath for dicts. Let's say the slowdown was limited to 6% for |
Should I push my code as an early ("[WIP]") with the slowdown I get? Instead I was exploring other possibly faster options, and those are currently failing.* I probably should recover the working version, while I remember the changes I made before recompiling again.
Yes, then it's less important if there is slowdown. I was just trying to to the simplest thing at first, one version, replacing the other, also to see the effect. AND only comparing to the status quo, the current unordered Dict, that's probably outdated: SwissDict is available, based on Google's:
also Ordered and regular RobinDict, but I wasn't sure about adding more recent code (less tested?). |
Regardless of what data structure gets the name In ordered dicts, since equality testing depends on order, In unordered dicts, since iteration order is undefined, What goes wrong is the "compromise" policy where iteration order is guaranteed but equality does not consider it. Equality should imply that every guaranteed part of the public interface is equal between the two dicts, but if iteration order is not considered then it's possible to have Python 3.7+ |
@bkamins pointed out on slack:
But it doesn't have to be this way.
If we redefine things so it remembers how many slots it should have,
then it comes out the same as it came in.
But this is annoying because it changes the serialization format.
I would rather change
sizehint!(::Dict)
or how we call it.The problem is that
sizehint!(Dict(), 26)
gives it 32 slots,but the
d
had 64 slots.In python this was one of thing things that really caught me out.
Because python salts its hashes wiith a random salt selected each time it starts.
But julia doesn't.
The text was updated successfully, but these errors were encountered: