Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Implement operators on Nullable with lifting semantics #16988

Closed
wants to merge 2 commits into from

Conversation

nalimilan
Copy link
Member

This moves to Base operators which were defined in NullableArrays. It keeps the lifting semantics: Nullable(x) $op Nullable(y) = Nullable(x $op y) when not null, and Nullable{promote_op{...}() when null. It generalizes these definitions in a safe yet efficient way by introducing a new null_safe_op(f, ::Type...) function for a fast-path with unchecked types (see commit message).

See JuliaStats/NullableArrays.jl#111 for discussion of the design. I've marked it as 0.5.0 since this really sounds like a blocker for progress on the data management front, and it shouldn't break anything.

The first commit is a quick hack to get the new tests to pass. Not sure that's the best strategy.

Cc: @johnmyleswhite @davidagold @quinnj @tshort

Error was triggered e.g. by false >> 0x01.
@nalimilan nalimilan added this to the 0.5.0 milestone Jun 17, 2016
@JeffBezanson JeffBezanson removed this from the 0.5.0 milestone Jun 17, 2016
@nalimilan
Copy link
Member Author

@JeffBezanson Care to discuss the milestone? That's a relatively limited change, but essential to make Nullable useful in real applications.

@JeffBezanson
Copy link
Sponsor Member

We have been doing weekly triage, and are trying very hard to work through the milestone issues. I think it's too late to add new features. Why will it make a big difference to have this in Base now?

@nalimilan
Copy link
Member Author

I know it's late, but there's been a push recently to start using Nullable with DataFrames and to improve their support (which is quite limited). Having to wait for 0.6 wouldn't help reaching that goal.

I guess we could keep these operators in NullableArrays, but they really don't belong there as they aren't specific to that type at all. For example, I also need them in CategoricalArrays.jl. Also, Base currently defines ==(::Nullable, ::Nullable) as an error, and we would need to overwrite it to make it useful.

@quinnj
Copy link
Member

quinnj commented Jun 17, 2016

+1 to including in 0.5 (or at least 0.5.x). We really are trying to make a big push to get rid of DataArrays and make DataFrames a stronger standard of type-niceness.

@davidagold
Copy link
Contributor

Might a middle road be to host them in a NullableMath package for now?

@StefanKarpinski
Copy link
Sponsor Member

How about a Nullables package, required be NullableArrays and CategoricalArrays, etc. FWIW, I think that having most of the Nullable machinery outside of Base Julia for the 0.5 cycle will help accelerate the development of nullables and everything that depends on them.

@nalimilan
Copy link
Member Author

That could work. I would say it depends on the strength of the consensus around these operators: if we all agree on the semantics, I'd say they could go into 0.5; if there's still some debate, better keep them in a package.

@JeffBezanson
Copy link
Sponsor Member

I do have some objections to the content here.

  • I don't want to officially sanction computing on invalid data. If there were an efficient way to give an error when accessing an uninitialized Int, we'd be doing that.
  • It seems strange to be concerned with SIMD performance on nullable values. Do we have performance numbers? I would think if you're at the point where you need SIMD, you have to eliminate nullables anyway.
  • Needing to define every function on Nullable strikes me as the vectorization issue all over again. I prefer doing lift(+), which can be implemented with a single definition. There is also a Nullables-as-containers proposal out there, which could potentially use the f.() syntax, but I haven't yet looked at that in detail.

@StefanKarpinski
Copy link
Sponsor Member

Is the part of this change that affects base/bool.jl separable? What's the motivation for that part?

@nalimilan
Copy link
Member Author

@JeffBezanson OK, if we still have to discuss fundamental issues like this, let's go with a package. That said, the discussion is interesting, so:

I don't want to officially sanction computing on invalid data. If there were an efficient way to give an error when accessing an uninitialized Int, we'd be doing that.

Well, yeah, but the day such a feature is introduced, the instruction set will probably provide a way of combining it with vectorization (i.e. ignore errors). Anyway, that's quite theoretical, and that optimization is an implementation detail which can be abandoned.

It seems strange to be concerned with SIMD performance on nullable values. Do we have performance numbers? I would think if you're at the point where you need SIMD, you have to eliminate nullables anyway.

Unfortunately, SIMD doesn't work at this point in my tests. Cf. JuliaStats/NullableArrays.jl#111 (comment). But enabling SIMD in the long-term is essential for data analysis with NullableArrays. @davidagold and @johnmyleswhite have done a lot of work to ensure the efficiency of that structure.

Needing to define every function on Nullable strikes me as the vectorization issue all over again. I prefer doing lift(+), which can be implemented with a single definition. There is also a Nullables-as-containers proposal out there, which could potentially use the f.() syntax, but I haven't yet looked at that in detail.

The apparent consensus among JuliaStats people seems to be that we should follow C# in implementing lifting for all standard operators, and use an explicit syntax for functions (a macro? the f.(x) syntax?). So this PR wouldn't have to be extended to other functions.


Is the part of this change that affects base/bool.jl separable? What's the motivation for that part?

@StefanKarpinski The first commit is a quick hack to make tests pass in corner cases without moving Bool to a specific block. But that's really peripheral. The changes included in the second commit are the cleanest way I could find to implement comparison operators; I could as well treat them as a special case and remove the promote_op method, but I figured it was natural to have it.

Use fast path without a branch for types with unchecked arithmetic
(for which the operation can be computed even when value is missing)
and a slow path for other types. The new null_safe_op() function
allows custom types to opt-in to the fast path when possible.
Also use this strategy in isequal(), which keeps its current
(non-lifting) behavior.
@johnmyleswhite
Copy link
Member

I would personally prefer not trying to land these things for 0.5 to give us more time to consider them.

@nalimilan
Copy link
Member Author

nalimilan commented Jun 17, 2016

I'm fine with that, but it looks like we still need a few changes in Base:

  • The promote_op for Bool: promote_op() fixes #16995
  • The bitshift operators on Bool fixes (less important, but still good to fix): Fix bitshift operators on Bool #16996
  • ==(::Nullable, ::Nullable): let's remove it as it's always an error anyway?
  • Less important, but isequal could be made faster using the null_safe_op mechanism. Do we care at this point? That's probably not a performance-critical function.

@nalimilan nalimilan mentioned this pull request Jun 17, 2016
@StefanKarpinski
Copy link
Sponsor Member

Those changes all seem good except the last one, which seems like a pure performance improvement, which means it's not essential to get it into the release.

@JeffBezanson
Copy link
Sponsor Member

If you remove == for Nullable, keep in mind it falls back to ===, so you will not get an error.

@nalimilan
Copy link
Member Author

If you remove == for Nullable, keep in mind it falls back to ===, so you will not get an error.

Yes, that's what I just realized. That sounds dangerous, since loading the package would change the behaviour of that function in a subtle way, possibly breaking code that relies on it. Maybe better keep it so that it always fails, and overwrite it from the package. That's not very nice because a warning is going to be printed all the time, but...

@nalimilan
Copy link
Member Author

Should I make a PR against NullableArrays or create a new package? Maybe better keep everything in one place until these are moved to Base. Else we need to handle deprecations in NullableArrays, which is a pain.

@nalimilan
Copy link
Member Author

I've moved this to JuliaStats/NullableArrays.jl#119 for now. Closing, but please continue discussing any point you think needs more consideration. Better do that sooner than later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing data Base.missing and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants