Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

port to Julia 0.7 #182

Merged
merged 12 commits into from
Sep 24, 2018
Merged

port to Julia 0.7 #182

merged 12 commits into from
Sep 24, 2018

Conversation

shashi
Copy link
Collaborator

@shashi shashi commented Jul 11, 2018

Replace DataValues.jl with Union{Missing,T} and NamedTuples.jl with native named tuples.

Get all but TableTraits.jl code to pass tests.

cc @davidanthoff I have commented tabletraits.jl code for now. Are you planning to make it possible to use it on 0.7 with Union{Missing, T}?

cc @quinnj

@piever maybe _is_subtype is now kind of redundant since Union{Missing, T} can be tested with <: for the same property?

Shashi Gowda added 2 commits July 11, 2018 16:10
- Remove Nullables and DataValues
- colnames now returns a tuple
- Remove NamedTuples dependency
- Get test_core.jl to pass
@piever
Copy link
Collaborator

piever commented Jul 11, 2018

I think we should switch to Base.promote_typejoin to widen element when collecting an iterable of tuples, but @nalimilan probably knows more about this.

I would also like to get rid of is_subtype, but we need to make sure that Base.promote_typejoin(S, T) === T if and only if S <: T. If that's not the case we could even use Base.promote_typejoin(S, T) !== T as a condition to widen (meaning, widen only if it would actually get strictly wider).

@nalimilan
Copy link
Member

I think we should switch to Base.promote_typejoin to widen element when collecting an iterable of tuples, but @nalimilan probably knows more about this.

Yes, that sounds right. Basically, do the same thing as map.

I would also like to get rid of is_subtype, but we need to make sure that Base.promote_typejoin(S, T) === T if and only if S <: T. If that's not the case we could even use Base.promote_typejoin(S, T) !== T as a condition to widen (meaning, widen only if it would actually get strictly wider).

I would say that's the case. At least promote_typejoin is equivalent to typejoin except for Union{T,Missing} and Union{T,Nothing}.

src/table.jl Outdated
@@ -28,7 +27,7 @@ struct NextTable{C<:Columns} <: AbstractIndexedTable
# Cache permutations by various subsets of columns
perms::Vector{Perm}
# store what percent of the data in each column is unique
cardinality::Vector{Nullable{Float64}}
cardinality::Vector{Any}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be Union{Float64,Missing}?

@davidanthoff
Copy link
Contributor

TableTraits.jl (and the broader Queryverse.jl family of packages) will continue to use DataValue for missing values, I think the consensus in various discussions was that the combination of named tuples and Union{T,Missing} is not ready for the Query.jl use case in the julia 1.0 time frame.

I'm kind of surprised that you aren't running into some of the same issues? This is essentially still the same set of problems that we discussed in JuliaData/Missings.jl#6 and https://discourse.julialang.org/t/missing-data-and-namedtuple-compatibility/8136. Have you benchmarked this branch?

TableTraits.jl is ported to 0.7, do you want me to open a PR that reenables the TableTraits.jl integration?

.travis.yml Outdated
@@ -4,7 +4,7 @@ os:
- linux
- osx
julia:
- 0.6
- nightly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should add 0.7 and 1.0 as well.

@andreasnoack
Copy link
Member

Bump. It would be great to get this merged soon.

@JeffBezanson
Copy link
Contributor

David's performance concern here is valid --- it might be better to do everything but the Missing change first.

@shashi
Copy link
Collaborator Author

shashi commented Sep 4, 2018

@davidanthoff I think @piever's iterator based approach with widencolumns was key to making this possible with missing. He did strategically design for that with this change in mind.

@quinnj
Copy link
Member

quinnj commented Sep 4, 2018

It would certainly be helpful if some benchmarks were actually performed. @piever, do you happen to have any comparisons you did in doing the widencolumns work?

@shashi
Copy link
Collaborator Author

shashi commented Sep 4, 2018

Of course. I was just commenting on the point about not being able to use Iterators + missing.

@shashi
Copy link
Collaborator Author

shashi commented Sep 4, 2018

Ah Okay I hadn't schooled myself on that discourse discussion, my bad.

@davidanthoff
Copy link
Contributor

@piever's code is great, but I wouldn't expect it to solve the performance problems. I might be wrong, so benchmarking and comparing would be a good idea.

@piever
Copy link
Collaborator

piever commented Sep 5, 2018

At the time I mainly tested that it didn't lose performance compared to the previous inference based implementation in the type stable case: as Missing in Julia 0.6 was slow anyway I didn't test it. I suspect there may be issues in the case of multiple columns allowing missing data, with the combinatorial explosion of possible types, but I don't have benchmarks.

OTOH, I think that JuliaData/Tables.jl#10 is managing to combine the good ideas from collect_columns and the good ideas from Tables and may be the ideal implementation: in case my implementation here has performance issues we could try using that one instead (I also think it's slightly suboptimal to have to maintain two separate implementations of the same thing). It would allow several other optimizations as well like iterating lazy rows (see here) rather than materializing NamedTuples like we currently do.

On a related note, I also seem to remember that some functions had yet to be ported to the new iteration framework (join for example) and still rely on inference, so we probably need to keep that in mind.

v = [@NT(a = 1, b = 2), @NT(a = 1, b = 3)]
@test collect_columns(v) == Columns(@NT(a = Int[1, 1], b = Int[2, 3]))
v = [(a = 1, b = 2), (a = 1, b = 3)]
@test collect_columns(v) == Columns((a = Int[1, 1], b = Int[2, 3]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to have a constructor for Columns using keyword args, so a level of parens can be dropped.

@test collect_columns(v) == Columns(a = Int[1, 1], b = Int[2, 3])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is such a constructor; maybe this is just checking that passing a (named)tuple also works.

@davidanthoff
Copy link
Contributor

I suspect there may be issues in the case of multiple columns allowing missing data, with the combinatorial explosion of possible types, but I don't have benchmarks.

Yes, that is the scenario I would worry about. Doesn't seem like a corner case to me :)

I think in general a good strategy would be to get this working on julia 0.7, with the minimal set of change, and then think about broader redesigns. No need to couple those two decisions and thereby delay everything by a lot.

@quinnj
Copy link
Member

quinnj commented Sep 11, 2018

Just pushed a commit that gets this further for 0.7/1.0, some remaining test failures involve:

  • Some complaints about isnull not being defined: @davidanthoff is this a DataValues thing/change?
  • Some complaints about no method for setindex_shape_check(::Bool, ::Int64); I'm guessing that was an internal Base method that got changed?
  • A bunch of join tests are still checking DataValue missing vs. missing
  • MethodError: no method matching +(::Int64, ::UnitRange{Int64}) from [1] valuenames(::NDSparse{Tuple{Int64,Int64},Tuple{Int64,Int64},Columns{Tuple{Int64,Int64},Tuple{Array{Int64,1},Array{Int64,1}}},Columns{Tuple{Int64,Int64},Tuple{Array{Int64,1},Array{Int64,1}}}}) at /Users/jacobquinn/.julia/dev/IndexedTables/src/ndsparse.jl:271
  • And a PooledArray error like MethodError: no method matching copy!(::Array{String,1}, ::Int64, ::PooledArray{String,UInt8,1,Array{UInt8,1}}, ::Int64, ::Int64).

Can anybody else pick things up from here? If not, I can try to take another crack or two over the next few days.

@piever
Copy link
Collaborator

piever commented Sep 11, 2018

Some complaints about isnull not being defined: @davidanthoff is this a DataValues thing/change?

This is also affecting StatPlots, FWIU it is isna now.

@JeffBezanson
Copy link
Contributor

The + should be .+ and copy! should be copyto!.

@davidanthoff
Copy link
Contributor

Yes, isna is the new isnull on 0.7+ for DataValue.

I haven’t gotten the broadcasting stuff to work for DataValues.jl on 0.7 yet. Not sure that is used anywhere here, though, just a heads up.

@shashi
Copy link
Collaborator Author

shashi commented Sep 12, 2018

BTW I did a simple map with tables of n missable/DataValue columns. It does scale exponentially with n if you're using Missing...

Nice work @quinnj ! I'll try to fix up the tests tomorrow (they still use missing and there are a couple of failures related to reflection). You can continue to fix other things...

@piever
Copy link
Collaborator

piever commented Sep 12, 2018

@quinnj did you happen to test the scaling of your buildcolumns implementation in case of unknown schema? If you have better performance we could try to port IndexedTables to use it, I suspect materializing the NamedTuples is bad for performance with several missable columns.

@davidanthoff
Copy link
Contributor

I wouldn’t expect unknown Schemas to be the problem, I think iterators that produce streams of values with heterogeneous types are the culprit (and those you easily get in a projection with Missing, but not so easily with DataValue). Those are really distinct.

@JeffBezanson
Copy link
Contributor

Got tests passing locally.

@JeffBezanson
Copy link
Contributor

OK, this fails on 0.7 due to the conflicting export of select, but should pass on 1.0 once PooledArrays is updated.

@JeffBezanson JeffBezanson force-pushed the s/0.7 branch 2 times, most recently from 443f2e3 to 890e303 Compare September 19, 2018 06:55
@JeffBezanson
Copy link
Contributor

Now with more TableTraits.

@@ -431,23 +427,28 @@ end

struct ApplyColwise{T}
functions::T
names::Vector{Symbol}
names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

@JeffBezanson
Copy link
Contributor

Brought back the disabled tests. This should be pretty good now.

@quinnj
Copy link
Member

quinnj commented Sep 20, 2018

Merge?

@JeffBezanson JeffBezanson changed the title WIP port to Julia 0.7 port to Julia 0.7 Sep 24, 2018
@JeffBezanson JeffBezanson merged commit 4e0d676 into master Sep 24, 2018
@shashi shashi deleted the s/0.7 branch December 24, 2018 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants