Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

describe fails on a categorical array column from RDatasets dataframe #2672

Closed
pgagarinov opened this issue Mar 23, 2021 · 7 comments · Fixed by JuliaData/CategoricalArrays.jl#342
Labels

Comments

@pgagarinov
Copy link

pgagarinov commented Mar 23, 2021

How to reproduce

using RDatasets
school = RDatasets.dataset("mlmRev","Hsb82")
describe(school)
cannot add new level No since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

Stacktrace:
  [1] merge_pools!(A::CategoricalArrays.CategoricalVector{String, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}}, B::CategoricalArrays.CategoricalValue{String, UInt8}; updaterefs::Bool, updatepool::Bool)
    @ CategoricalArrays ~/.julia/packages/CategoricalArrays/MYrys/src/array.jl:468
  [2] merge_pools!
    @ ~/.julia/packages/CategoricalArrays/MYrys/src/array.jl:465 [inlined]
  [3] setindex!
    @ ~/.julia/packages/CategoricalArrays/MYrys/src/array.jl:496 [inlined]
  [4] collect_to!(dest::CategoricalArrays.CategoricalVector{String, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}}, itr::Base.Generator{Vector{Dict{Symbol, Any}}, DataFrames.var"#68#75"{Symbol}}, offs::Int64, st::Int64)
    @ Base ./array.jl:728
  [5] collect_to_with_first!(dest::CategoricalArrays.CategoricalVector{String, UInt8, String, CategoricalArrays.CategoricalValue{String, UInt8}, Union{}}, v1::CategoricalArrays.CategoricalValue{String, UInt8}, itr::Base.Generator{Vector{Dict{Symbol, Any}}, DataFrames.var"#68#75"{Symbol}}, st::Int64)
    @ Base ./array.jl:702
  [6] collect(itr::Base.Generator{Vector{Dict{Symbol, Any}}, DataFrames.var"#68#75"{Symbol}})
    @ Base ./array.jl:683
  [7] _describe(df::DataFrame, stats::Vector{Symbol})
    @ DataFrames ~/.julia/packages/DataFrames/oQ5c7/src/abstractdataframe/abstractdataframe.jl:639
  [8] describe(df::DataFrame; cols::Function)
    @ DataFrames ~/.julia/packages/DataFrames/oQ5c7/src/abstractdataframe/abstractdataframe.jl:569
  [9] describe(df::DataFrame)
    @ DataFrames ~/.julia/packages/DataFrames/oQ5c7/src/abstractdataframe/abstractdataframe.jl:569
 [10] top-level scope
    @ In[45]:1
 [11] eval
    @ ./boot.jl:360 [inlined]
 [12] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base ./loading.jl:1094

Versions

Julia v"1.6.0-rc3"

      Status `~/.../Project.toml`
  [336ed68f] CSV v0.8.4
  [324d7699] CategoricalArrays v0.9.3
  [a93c6f00] DataFrames v0.22.5
  [6deec6e2] IndexedTables v1.0.0
  [1a8c2f83] Query v1.0.0
  [ce6b1742] RDatasets v0.7.4
  [f3b207a7] StatsPlots v0.14.19
  [3eeacb1d] TabularDisplay v1.2.0
@pgagarinov pgagarinov changed the title [BUG] describe fails on categorical array column [BUG] describe fails on a categorical array column from RDatasets Mar 23, 2021
@pgagarinov pgagarinov changed the title [BUG] describe fails on a categorical array column from RDatasets [BUG] describe fails on a categorical array column from RDatasets dataframe Mar 23, 2021
@pgagarinov pgagarinov changed the title [BUG] describe fails on a categorical array column from RDatasets dataframe [BUG] "describe" fails on a categorical array column from RDatasets dataframe Mar 23, 2021
@bkamins bkamins changed the title [BUG] "describe" fails on a categorical array column from RDatasets dataframe describe fails on a categorical array column from RDatasets dataframe Mar 23, 2021
@bkamins bkamins added the bug label Mar 23, 2021
@bkamins bkamins added this to the 1.0 milestone Mar 23, 2021
@bkamins
Copy link
Member

bkamins commented Mar 23, 2021

@nalimilan - the problem is this line: https://github.com/JuliaData/DataFrames.jl/blob/main/src/abstractdataframe/abstractdataframe.jl#L643
where due to the fact that comprehension sees an ordered categorical value as a first element the rest fails. This is related to automatic creation of CategoricalArray by comprehensions. I have opened JuliaData/CategoricalArrays.jl#333 to track it.

What do you think we should do?

@nalimilan
Copy link
Member

I've commented in CategoricalArrays. I'm afraid there's not much we can do on the DataFrames side. describe is a bit peculiar since it mixes values from different columns in a single column. We could create Vector{Any} columns, but that would be suboptimal for cases where all columns have the same type.

@bkamins
Copy link
Member

bkamins commented Mar 23, 2021

For the time being we could do:

v = [d[stat] for d in col_stats_dicts]
try
    v = [x for x in v]
catch
end

In this way we will narrow down the eltype if possible and otherwise leave it Any. It should be a big performance hit.

@nalimilan
Copy link
Member

Yeah that would be OK. Though it's kind of ugly...

@bkamins
Copy link
Member

bkamins commented Mar 24, 2021

If you agree on what I propose in JuliaData/CategoricalArrays.jl#333 (i.e. to make sure that this comprehension does not throw an error then I would not change anything in DataFrames.jl).

We have to fix it in general I think as otherwise e.g. the following also fails:

julia> df = DataFrame(a=categorical([1,2,3], ordered=true), b=categorical([4, 5, 6], ordered=true))
3×2 DataFrame
 Row │ a     b
     │ Cat…  Cat…
─────┼────────────
   1 │ 1     4
   2 │ 2     5
   3 │ 3     6

julia> select(df, [:a, :b] => ByRow((x,y) -> x == 1 ? x : y))
ERROR: cannot add new level 4 since ordered pools cannot be extended implicitly. Use the levels! function to set new levels, or the ordered! function to mark the pool as unordered.

and we cannot fix such things manually everywhere in DataFrames.jl.

@bkamins
Copy link
Member

bkamins commented Mar 25, 2021

I would close this and fix the issue in CategoricalArrays.jl

@bkamins
Copy link
Member

bkamins commented Mar 30, 2021

bump - @nalimilan do you think we can close this PR here and delegate to CategoricalArrays.jl to fix it there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants