Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

findfirst/findlast/nextind/prevind not working for eachcol in v0.21.0 #2229

Closed
rasmushenningsson opened this issue May 6, 2020 · 9 comments · Fixed by #2291
Closed

findfirst/findlast/nextind/prevind not working for eachcol in v0.21.0 #2229

rasmushenningsson opened this issue May 6, 2020 · 9 comments · Fixed by #2291
Labels
bug feature non-breaking The proposed change is not breaking priority
Milestone

Comments

@rasmushenningsson
Copy link

This code works in v0.20.2,

df = DataFrame(x=["a","b","c"], y=1:3)
findfirst(isequal(1:3),eachcol(df))

but errors in v0.21.0 with the error message.

ERROR: MethodError: no method matching nextind(::DataFrames.DataFrameColumns{DataFrame}, ::Symbol)

The cause seems to be that the underlying indices are now Symbols and not Ints. I didn't see this in the list of breaking changes so I figured it might be unintentional.
(A note, findall() works, but returns Symbols instead of Ints as before.)

Thanks for the new release and all the hard work!

@bkamins
Copy link
Member

bkamins commented May 6, 2020

This was intentional to make the following work consistently:

julia> keys(eachcol(df))
2-element Array{Symbol,1}:
 :x
 :y

julia> pairs(eachcol(df))
Iterators.Pairs(::DataFrames.DataFrameColumns{DataFrame}, ::Array{Symbol,1}) with 2 entries:
  :x => ["a", "b", "c"]
  :y => [1, 2, 3]

This was consider more useful than:

julia> x = rand(3)
3-element Array{Float64,1}:
 0.5557510873245723
 0.43710797460962514
 0.42471785049513144

julia> keys(x)
3-element LinearIndices{1,Tuple{Base.OneTo{Int64}}}:
 1
 2
 3

julia> pairs(x)
pairs(::Array{Float64,1}) with 3 entries:
  1 => 0.555751
  2 => 0.437108
  3 => 0.424718

but indeed it has a drawback you have highlighted.

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

@bkamins bkamins added the feature label May 6, 2020
@bkamins bkamins added this to the 1.x milestone May 6, 2020
@bkamins bkamins added the non-breaking The proposed change is not breaking label May 6, 2020
@bkamins
Copy link
Member

bkamins commented May 6, 2020

(of course breaking these functions was not intentional and I have forgotten to mention this in release notes as there were 102 PRs to annotate so it slipped through)

@rasmushenningsson
Copy link
Author

I think the solution is to add custom implementations of the functions you mention. Would that work for you?

Absolutely. I don't think it's a common use case, but nice to have once in a while.

@bkamins
Copy link
Member

bkamins commented May 29, 2020

We should also fix:

julia> df = DataFrame(a=1,b=2)
1×2 DataFrame
│ Row │ a     │ b     │
│     │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1   │ 1     │ 2     │

julia> hash(eachcol(df))
ERROR: ArgumentError: invalid index: :b of type Symbol

cf. JuliaLang/julia#36073

CC @nalimilan

@bkamins bkamins modified the milestones: 1.x, 1.0 May 29, 2020
@bkamins
Copy link
Member

bkamins commented May 30, 2020

@nalimilan - I have investigated this issue. There is no way to cleanly resolve it. We have the following options:

  1. stop making DataFrameColumns a subtype of AbstractVector (this is my preferred choice, the drawback is that it is mildly breaking; the good thing is that eachcol in Base does not return AbstractVector); then we just should add the methods we believe are worth to keep being supported by this type
  2. make keys(::DataFrameColumns) produce integers - this would cleanly resolve everything, but I think it would not be very useful in practice; so I do not like it
  3. leave things as is and just keep adding methods that "patch" the problematic functions (for now it seems to be hash and nextind) -> this is not a very good approach as we will be never sure in which case something breaks

In summary - I would go for option 1 (stop making DataFrameColumns a subtype of AbstractVector).

@nalimilan
Copy link
Member

Maybe we can decide on a more general solution based on how JuliaLang/julia#36073 is resolved? If keys has to return a vector of linear or cartesian indices for all AbstractVectors, then maybe we can add another function (in DataAPI?) to get less efficient but more informative names, which would be useful for DataFramesColumns, Tables.AbstractColumns but also named arrays types like NamedArray, AxisArray and the newer packages. Or if keys can return any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.

Related issues: JuliaArrays/AxisArrays.jl#152 (comment), JuliaArrays/AxisArrays.jl#81, invenia/NamedDims.jl#86.

@bkamins
Copy link
Member

bkamins commented May 30, 2020

Sure - we can wait.

any object as long as it can be used to index into LinearIndices(dfc) and CartesianIndices(dfc), we can try something more complex.

I understand that then you would have to provide a custom type that would be returned by LinearIndices(dfc) etc. - which seems a bit problematic, but maybe this is the way to go?

@bkamins
Copy link
Member

bkamins commented Jun 11, 2020

@nalimilan - given the comment in JuliaLang/julia#36073 I think we should remove AbstractVector subtype from DataFrameColumns and just add custom methods that normally work for vectors as needed. If there is no objection I will implement this.

@rasmushenningsson - I guess it should be OK with you - right?

@bkamins
Copy link
Member

bkamins commented Jun 16, 2020

Please have a look at #2291 for the implementation. Actually maybe I even like it more as indexing with multiple columns to DataFrameColumns now produces DataFrameColumns which seems more consistent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug feature non-breaking The proposed change is not breaking priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants