Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OrderedDict to Base (for Julia 1.6?) #37761

Closed
PallHaraldsson opened this issue Sep 25, 2020 · 7 comments
Closed

Add OrderedDict to Base (for Julia 1.6?) #37761

PallHaraldsson opened this issue Sep 25, 2020 · 7 comments

Comments

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Sep 25, 2020

I would like to propose OrderedDict added (I guess unexported, at least for now), to Julia 1.6. When It or later version will be LTS, and Julia 1 dropped as LTS, I feel it important it is readily available in Base, to be available in all used Julia versions from the (faster) system image.

Why? You can use DataStructures.jl, but it's a hassle to add it to each library that needs this one important container (I just fixed a bug in Pandas, that would have worked by just renaming Dict to OrderedDict had it been available, or Base.OrderedDict in this proposal), and will slow down (on my recent master):

julia> @time using DataStructures: OrderedDict
  0.558825 seconds (461.80 k allocations: 28.509 MiB, 3.54% gc time)

This seems like a small PR, that I could even do. That code is available and fully baked, and think it's still Jeff's code, code I trust.

I'm NOT proposing changing the default of DIct to OrderedDict, at least for now, as I think it's still slower, and we can't really go back with that API decision. I note however that's what Python did:

"Ordered dictionaries are just like regular dictionaries but have some extra capabilities relating to ordering operations. They have become less important now that the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7).
[..]
Until Python 3.8, dict lacked a reversed() method."

So, at a minimum I propose just A.:

A.
Add OrderedDict, I think it needs to be NOT exported to avoid conflict in case DataStructures.jl will also be used.

B.
I think we could make a "standard library" named Ordered, so that:

using Ordered

will change the meaning of Dict to OrderedDict.

C.
Maybe at some point in the future we can do B. by default, and add UnorderedDict.

@tkf
Copy link
Member

tkf commented Sep 25, 2020

Why not use OrderedCollections.jl? DataStructures.jl just re-exports OrderedCollections.jl:

https://github.com/JuliaCollections/DataStructures.jl/blob/355a7a8c3016808cc682a31f1b772d1b369cafc8/src/DataStructures.jl#L10-L12

Importing OrderedCollections seems to be quite fast:

$ julia --startup-file=no -e '@time using DataStructures'
  0.155198 seconds (107.88 k allocations: 8.267 MiB)

$ julia --startup-file=no -e '@time using OrderedCollections'
  0.034274 seconds (29.91 k allocations: 2.282 MiB)

$ julia --startup-file=no -e '@show VERSION'
VERSION = v"1.6.0-DEV.1045"

@yuyichao
Copy link
Contributor

You can use DataStructures.jl, but it's a hassle to add it to each library that needs this one important container

You can apply that to basically every popular package so that's not a good argument for anything.

Unless it's absolutely needed in Base, it should not be included. OTOH, including it in the stdlib would be OK.

@KristofferC
Copy link
Member

What can happen is that Julia's dict becomes ordered at some point (#10116) but new very similar Dict types will not be put in.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Sep 25, 2020

OTOH, including it in the stdlib would be OK.

Ok, that's pretty much what I meant, include an ordered variant, just not exported (why I mentioned Base), to keep compatibility. I'm not up-to-speed, but I'm not sure there's any "the" stdlib, only stdlibS (#27197) and I think adding OrderedCollection.jl as one would be great.

ordered at some point (#10116) but new very similar Dict types will not be put in

The "at some point" argument, means we've been missing out on this container as "batteries-included" for 5+ years (I know the arguments for and against inclusion of any code in general, and specifically for and against ordered). I think it's really important to have it in, even as the default, prevents bugs, and unordered should be opt-in.

It's true I overlooked OrderedCollections.jl until I now when I started looking into locating the OrderedDict code to copy it for a PR. The speed argument of loading isn't the best, and Revise3 made loading 4x slower, however, when you look into the packages, you need to decide between 3+ alternatives: OrderedDict, OrderedRobinDict, LittleDict (SwissDict?) and for new users, to exclude false DefaultOrderedDict, OrderedSet, MultiDict (FrozenLittleDict, UnfrozenLittleDict?).

I think an argument can be made for users to have one reasonably good default (ordered) option included, then advanced users look for options, e.g. [Unordered]Dict if it's actually faster (LittleDict is often faster).

@timholy
Copy link
Member

timholy commented Sep 26, 2020

and Revise3 made loading 4x slower

Huh? Not even close.

Fresh Julia 1.6 session, Revise 2.7.6:

tim@diva:~/.julia/dev/GridLayoutBase$ time julia-master --startup-file=no -q -e 'using Revise; yield(); using Images; yield()'

real	0m7.234s
user	0m7.201s
sys	0m0.328s

Now with Revise 3.1.2:

tim@diva:~/.julia/dev/GridLayoutBase$ time julia-master --startup-file=no -q -e 'using Revise; yield(); using Images; yield()'

real	0m4.376s
user	0m4.336s
sys	0m0.316s

That's almost 2x faster, not 4x slower.

@PallHaraldsson
Copy link
Contributor Author

PallHaraldsson commented Sep 26, 2020

That's almost 2x faster, not 4x slower.

[We should follow up on this off-topic at: https://github.com/timholy/Revise.jl/issues/542]

You've comparing Revise2 to Revise3, I meant just loading without Revise vs. with Revise3 (even disregarding the startup cost of Revise itself):

$ time julia --startup-file=no -e '@time using Revise; yield(); @time using DataStructures'
  0.725897 seconds (664.35 k allocations: 45.401 MiB, 2.20% gc time)
  0.520156 seconds (461.80 k allocations: 28.509 MiB)

real	0m1,518s
user	0m1,830s
sys	0m0,565s

$ time julia --startup-file=no -e '@time using DataStructures'
  0.118311 seconds (99.83 k allocations: 7.564 MiB)

real	0m0,372s
user	0m0,723s
sys	0m0,522s

@timholy
Copy link
Member

timholy commented Sep 26, 2020

There are several issues here. I'll get back to the issue of "would it help to move OrderedDict to a stdlib?" (answer: no) in a sec, let's first tackle how you're thinking about latency.

You seem to imagine all factors are multiplicative. You're better off thinking of a model where the hit is c + m*p, where c is a one-time startup cost and m is the extra latency per package and p has something to do with the number of packages and/or quantity of code. I see your demo above, but let's also look a this one:

tim@diva:~/Info/papers$ time julia-master --startup-file=no -e 'using Flux; yield()'

real	0m11.028s
user	0m10.769s
sys	0m0.532s
tim@diva:~/Info/papers$ time julia-master --startup-file=no -e 'using Revise; yield(); using Flux; yield()'

real	0m12.416s
user	0m12.073s
sys	0m0.604s

Together (a small package like DataStructures and a large package like Flux), these two suggest that the only statement that is even remotely accurate is to say that Revise has a ~1s constant latency (c = 1s) and at most a small multiplicative hit, so small that it will take more careful measurements than this to determine what it is. Definitely, definitely not 4x.

Back to the OrderedCollections issue. You are right that much of Revise's extra latency is due to inferring and compiling Dict and OrderedDict methods. But the problem is that currently precompilation can't cache even the inference results of Dict methods (as shown in #31466) due to the issues mentioned in #32705. So the fact that Revise needs to create type that are not in Base means that this wouldn't help. The only thing that would help would be to have an ordered Dict in Base and move all of Revise's key data types to Base, so that it could be built into the system image.

But you can also do that now with PackageCompiler, and given your regular interest in latency I recommend you give it a try and see if you like it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants