-
Notifications
You must be signed in to change notification settings - Fork 50
Interface for high-performance indexing operations #71
Comments
Overall, this is a great approach. One issue with this is that the result of |
Why is |
Got it. I misunderstood what you were planning to return. On Sunday, February 2, 2014, John Myles White notifications@github.com
|
To make sure everybody's on the same page here: this change would make DataArrays mutable in a potentially dangerous way, since you could modify one of the underlying values array and/or missingness mask without modifying the other. |
Could we do: isna(da::DataArray, inds...) = getindex(da.na, inds...)
values(da::DataArray, inds...) = getindex(da.data, inds...) Seems like this would be less dangerous, and also more easily adapted to deal with PooledDataArrays. |
That's a way better idea than mine. |
One of the things that troubles me about this approach (which I think is still such an improvement over our current interface that we should move forward with it regardless) is it makes getindex and setindex! less symmetric. |
I see these not as desirable parts of our API, but as stopgap solutions until Julia does a better job of handling union types. We don't have a type inference problem for |
Agreed. |
Spent some time discussing this exact issue (re. compilation of R) with Duncan Temple Lang the other day. If we manage to find a solution to nominal/ordinal variables that allows us to rid of PDA's, it might be nice to introduce a
This could then get expanded into some trivial function call like |
In 80068de, I set up @simonster's generalized definition of |
Over time, I've come to change my mind about this issue. I now think that we will want, in the future, to implement option types as the representation of scalar values with potential missingness. We'd keep DataArrays around, but have all scalar indexing into them return wrapped option types. This would solve all of the type stability issues, although it requires moving our idioms further from R's. |
If I'm understanding you correctly, with this change we'd have to give up on passing DataArrays to functions that accept AbstractArrays. This would seem to move our idioms further from Julia's as well, and I'm not sure that's worth it. The generality is often useful despite the performance cost. An alternative approach might be to create a macro that converts |
Why would we have to give up on |
We could have I think we could probably make the most common idioms work, but other code that works fine with Arrays and currently works slowly with DataArrays would not work at all. I'm still hoping we'll get better handling of union types at the compiler level... |
I'd like to have a longer chat with the folks working on the compiler about their opinions on this topic. I feel like working with From my perspective, I think we'd want to maintain our current assumption that To elaborate on that: my thinking is that you'd want to start with the simplest possible implementation of In cases in which Stefan, Jeff and the rest are visiting my office on Friday. I'm planning to spend some time chatting with them about this then. |
This makes much sense as this would make The other possibility is to allow the compiler to mark that |
Yes, we would keep The speedup comes from code like the following:
This code has no type-instability, which makes it much easier for Julia to optimize under the assumption that The sentinel values approach used by R is interesting, but doesn't generalize to arbitrary types. It's much easier to allow people to do something like:
In cases where this is a sensible default, this lets people choose it as appropriate for their problem. Trying to invent a sensible default for all problems doesn't work. |
@johnmyleswhite But wouldn't (I don't really agree that "sentinel values" do not generalize to arbitrary types: it's mostly a problem for basic types like integers. There are many ways of storing a NULL/NA value in more complex types, and as most of them are just composite types, one just needs to make one of its fields an |
The construction of an immutable type has almost no cost. |
AFAIK, the reason we teach people to avoid union types in parts of their code not interacting with DataArrays is that they are slow, not that there is anything inherently wrong with them. If we didn't want people to use them at all, I don't think they'd exist at all. There's a difference between what is presently fast and what can be made fast. I would suggest that we design around the latter and provide (hopefully temporary) workarounds for the former. Optimizing union types is probably a decent amount of work, but it's something that JITs for dynamic languages generally do (see polymorphic inline caching). This particular case should be reasonably simple since the possible types can still be statically inferred. I believe what is needed for good performance is 1) to pass union types of immutables on the stack or in registers instead of allocating them on the heap and 2) to emit two direct calls with a branch instead of calling via I'm not sure if we can make The Here is the set of things I think we need for indexing a DataArray to be fast:
(1) and (2) can be satisfied by either the |
As mentioned in JuliaData/DataFrames.jl#523, we might want to expose an "unsafe" interface to the underlying values of a DataArray for those trying to do high-performance work:
NA
by makingisna
return a reference. We could also implementing complex indexing forisna(da, inds...)
but that seems like a lot of needless work.values(da)
, which will have undefined values for anyNA
entries.This would put us in a position to write code like:
We could make this code very fast because it would be perfectly type stable. As a (probably too) radical step, we could even change
getindex
to implement the semantics ofvalues
.The text was updated successfully, but these errors were encountered: