Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames 0.11+ compat via IndirectArrays #1090

Merged
merged 18 commits into from
Mar 30, 2018

Conversation

andreasnoack
Copy link
Contributor

@tlnagy Here is my version of the DataFrame 0.11+ compatible version. This version passes locally (see caveat below) with this version of IndirectArrays. I decided that the wrapping behavior of CategoricalArrays was too inconvenient to get working with the current setup because it is implicitly assumed in most functions that "categorical" arrays return the same kand of values as non-categorical arrays. Maybe by restructuring the code to rely more on dispatch to smaller specialized methods it might be possible to use CategoricalArrays internally, but I'm not sure it is worth it. Basically, I don't think it is useful to have the pool handy for the use cases here in Gadfly.

Although the tests pass, it requires that Julia runs with --compilecache=no. Otherwise, it triggers what appears to be completely spurious ambiguity errors that we'd need @vtjnash or @JeffBezanson to comment on. It is hard for me to narrow down the issue but it could be related to module/using/include. E.g. I can trigger/detrigger the issue by including a file where all lines are commented. I've also been able to trigger/detrigger it with a foo() = 1 definition in the middle of the statistics.jl file.

src/Gadfly.jl Outdated
@@ -1121,7 +1123,7 @@ end

import Juno: Juno, @render, media, Media, Hiccup

media(Plot, Media.Plot)
# media(Plot, Media.Plot)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break Juno's plot rendering.

@andreasnoack
Copy link
Contributor Author

So the issue mentioned above is fixed by JuliaLang/julia#25832. I've tried to cherry-pick the fix on release-0.6 branch and it works. This means that getting Gadfly working with the new Missing based data stack requires a new 0.6.x release which currently is not on the roadmap because of resource constraints. Hence there is a risk that we'll only get updated Gadfly supported after 0.7/1.0 is out. As a consequence, I've started the process of getting this branch working on Julia master.

@tlnagy
Copy link
Member

tlnagy commented Feb 1, 2018

Hence there is a risk that we'll only get updated Gadfly supported after 0.7/1.0 is out.

I think we're getting pretty close to a 0.7 release, no? I think that at this point it's been several months since the new DataFrames came out and the damage is likely done (people either pinned DataFrames or dropped Gadfly) so I wouldn't mind waiting till 0.7 came out.

As a consequence, I've started the process of getting this branch working on Julia master.

Awesome!

@bjarthur
Copy link
Member

bjarthur commented Feb 2, 2018

@andreasnoack could you add --compilecache=no to this PR in travis.yml? if it passes with that i'd suggest we merge. could also add a temporary warning to the README/docs about needing to run with that flag if you use DataFrames.

@andreasnoack
Copy link
Contributor Author

I'd need to finish up JuliaArrays/IndirectArrays.jl#12 first but then I might try to add the flag here.

@bjarthur
Copy link
Member

bjarthur commented Feb 4, 2018

i tried checking out this PR, the PR in IndirectArrays, and using --compilecache=no. works great except test/testscripts/boxplot.jl. throws a DimensionMismatch error. can you confirm?

@andreasnoack
Copy link
Contributor Author

I'm able to reproduce that error. I thought I had them all working. Will take a look.

@andreasnoack
Copy link
Contributor Author

andreasnoack commented Feb 6, 2018

Oh. Now I know what is going on. I had added what I thought was a harmless deprecation fix. Should work now.

Btw. There'll be some issues with the testing when we transition to 0.7 because 0.7 doesn't produce the same random Int64s as 0.7 for the same seeds. Hence the uuids embedded in the produces images will be different so we'd need a way to handle that in the regression testing.

@bjarthur
Copy link
Member

bjarthur commented Feb 7, 2018

can we use a different seed to reproduce the old behavior?

@andreasnoack
Copy link
Contributor Author

Unfortunately, I don't think that is possible. See also JuliaLang/julia#25058 (comment)

@kleinschmidt
Copy link
Contributor

kleinschmidt commented Feb 8, 2018

It looks like there's an issue somewhere here with levels/sorting. I'm getting a mismatch between group labels and the actual data plotted using Geom.subplot_grid.

current master (which is correct):

screenshot from 2018-02-07 20-37-19

this PR (with @andreasnoack PR on IndirectArrays.jl as well):

screenshot from 2018-02-07 20-37-41

@nalimilan
Copy link
Contributor

It looks like there's an issue somewhere here with levels/sorting. I'm getting a mismatch between group labels and the actual data plotted using Geom.subplot_grid.

I don't really understand what's going on in the figure you showed, but if that's related to the order of levels then it might be due to the fact that IndirectArray does not support a custom ordering of levels? I've just realized that it could be a significant problem: with this PR, is the custom ordering of levels preserved when plotting for CategoricalArray variables? That's one of the main advantages of this type, and PooledDataArray supported it.

src/scale.jl Outdated
discretize_make_pda(values::CategoricalArray, ::Void) = discretize_make_pda(values)
function discretize_make_pda(values::CategoricalArray{T}, levels::Vector) where {T}
index = map!(t -> ismissing(t) ?
findfirst(ismissing, levels) :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really weird indentation here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the first four space aligned position after the (. Arguments are aligned to the ( but this splits a ternary instead of two arguments so it should be indented. The call is not super easy to read as is but the indentation makes it easier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT there are only three spaces, not four (assuming the count starts after (, i.e. starts with the t). More importantly, what bugs me is that the second operand isn't aligned with the first one.

Copy link
Contributor

@nalimilan nalimilan Mar 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, and more importantly, why is get needed here? Shouldn't be AFAICT. That's also going to be very very slow. Better do something like:

mapping = coalesce.(indexin(CategoricalArrays.index(values.pool), levels), 0)
pushfirst!(mapping, coalesce(findfirst(ismissing, levels), 0))
mapping .+= 1
index = [mapping[x+1] for x in values.refs]

EDIT: I'm not sure what should happen for values which are not in levels: throw an error?

The same code should also work when levels is a CategoricalVector (i.e. no need for the special method below).

Could also drop the "pda" terminology.

@andreasnoack
Copy link
Contributor Author

I'll have to take a closer look. Do you have a minimal example to reproduce this? My guess is that I'd have to be more careful when converting a CategoricalVector to an indirect Vector such that the values end up in the right order.

@nalimilan
Copy link
Contributor

Sorry, I don't use Gadfly, but I guess something like this should do if you set a custom non-alphabetical order for countries (e.g. grouped by regions). Note that it may already be working, I just happened to think about it when reading the description of the issue above.

@kleinschmidt
Copy link
Contributor

@nalimilan the problem with the figure I showed is that the labels on the facets (\alpha=0.01, 0.1, 1, 10) are wrong. with the PR they've been circshifted by one, while the facets are still plotted in the same order (just the labels have changed). The values that are used to group the facets are a data frame column that's an Array{String}.

I will look into this more later and figure out a minimal example but don't have time right at the moment.

@bjarthur
Copy link
Member

i merged a PR which requires some conflicts to be resolved with this one. sorry, but they are small. good thing is that regression testing is even easier now!

@andreasnoack
Copy link
Contributor Author

@kleinschmidt I've pushed a possible fix for this. Would be great if you could try it out.

@bjarthur No problem. I've rebased.

@bjarthur
Copy link
Member

before we merge this, would it make sense to alter REQUIRE to put a ceiling on DataFrames of 0.11.4? if so, i'd say let's do that ASAP.

also, why were breaking changes in DataFrames made on a point release? shouldn't they have bumped the minor version?

@andreasnoack
Copy link
Contributor Author

would it make sense to alter REQUIRE to put a ceiling on DataFrames of 0.11.4?

I'm not sure I follow. Why would you put a ceiling there? We could try to put a ceiling at the next breaking changing version but DataFrames doesn't follow SEMVER (yet) so it for to know if it 0.14, 0.17, or 1.0.

The breaking change in DataFrames happened at 0.11.0 but there have been some bug fixes after that. I think I decided to use 0.11.4 because it has a fix for the deprecated readtable function.

I'll try to finish the IndirectArrays PR soon such that we can have this merged.

@bjarthur
Copy link
Member

the motivation to upperbound DataFrames now is to fix the master branch of Gadfly now and give you time to finish without hurrying. i assumed the breaking change was 0.11.4 since that's what this PR specifies in its REQUIRE.

@andreasnoack
Copy link
Contributor Author

The previous test error was because of an improvement in RData. Hopefully, tests will now pass.

REQUIRE Outdated
@@ -16,3 +16,10 @@ Loess
Showoff 0.0.3
StatsBase
Juno
IndirectArrays 0.4.1
Missings
# Gadfly doesn't use WeakRefString directly but because of a but in Julia 0.6.2, Gadfly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bug not but

@tlnagy
Copy link
Member

tlnagy commented Mar 29, 2018

Old behavior:

a = DataFrame(diff = PooledDataArray(Float64[1,2,3,3,3,4,3,2]))
plot(a, x="diff", Geom.histogram)

New behavior:

a = DataFrame(diff = CategoricalArray(Float64[1,2,3,3,3,4,3,2]))
plot(a, x="diff", Geom.histogram)

I think the new behavior makes sense with the transition from PooledDataArrays to CategoricalArrays. The latter is what I would expect for categorical data. And you can get the old behavior back if you just use an Array

a = DataFrame(diff = Float64[1,2,3,3,3,4,3,2])
plot(a, x="diff", Geom.histogram)

@tlnagy
Copy link
Member

tlnagy commented Mar 29, 2018

I'm okay with merging since tests are passing and there aren't any regressions. @bjarthur, thoughts?

@@ -173,6 +174,9 @@ function apply_scale(scale::ContinuousScale,
for (aes, data) in zip(aess, datas)
for var in scale.vars
vals = getfield(data, var)
if vals isa CategoricalArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a strong reason to throw an error here? I fully agree that by default CategoricalArray should be treated as categorical even if it contains numeric values (that was the rationale for moving from PooledDataArray), but if the user explicitly requests a continuous scale, why not accept it? Or is it because the code fails later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a strong reason to throw an error here?

Yes. It won't work. The CategorialcalValues will cause failures sooner or later so I thought an informative error would be more helpful. The specific error in the example I tried is a missing isfinite(::CategoricalValue). As mentioned previously, the core of this PR is to avoid that CategoricalValues escape to Compose. In the discrete case, all data vectors go through discretize where CategoricalArrays are converted to IndirectArrays to avoid CategoricalValues but in the continuous case there isn't a similar function we can use for making sure all CategoricalArrays are converted.

that was the rationale for moving from PooledDataArray

I don't really understand why it wouldn't be exactly the same thing for PooledDataArrays. I'd also expect them to be treated as categorical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. That can always be improved later anyway if somebody needs it. Thanks for finishing this porting work!

I don't really understand why it wouldn't be exactly the same thing for PooledDataArrays. I'd also expect them to be treated as categorical.

Yeah, but there had always been an ambiguity, as PDAs were also used as an efficient storage type (e.g. they supported math operations). Now the distinction should be clearer, if only because of the CategoricalArray name.

@bjarthur
Copy link
Member

subplot_grid and subplot_grid_free_axis still show regression differences for me.

@andreasnoack
Copy link
Contributor Author

subplot_grid and subplot_grid_free_axis still show regression differences for me.

I get the plots below. Indeed the years have flipped but I think the new version is the correct one.

julia> barley = dataset("lattice", "barley");

julia> levels!(barley[:Year], ["1931", "1932"]);

julia> barley[(barley[:Year] .== "1931") .& (barley[:Site] .== "Morris") .& (barley[:Variety] .== "No. 475"),:]
1×4 DataFrames.DataFrame
│ Row │ Yield │ Variety │ Year │ Site   │
├─────┼───────┼─────────┼──────┼────────┤
│ 122.6  │ No. 4751931 │ Morris │

julia> barley[(barley[:Year] .== "1932") .& (barley[:Site] .== "Morris") .& (barley[:Variety] .== "No. 475"),:]
1×4 DataFrames.DataFrame
│ Row │ Yield   │ Variety │ Year │ Site   │
├─────┼─────────┼─────────┼──────┼────────┤
│ 144.2333 │ No. 4751932 │ Morris │

cached

subplot_grid

genned

subplot_grid

@bjarthur
Copy link
Member

agreed! i've tagged gadfly and compose. as soon as that's merged into METADATA i'll merge this.

@bjarthur bjarthur merged commit c778af5 into GiovineItalia:master Mar 30, 2018
@andreasnoack andreasnoack deleted the anj/indirect branch March 30, 2018 16:06
@bjarthur
Copy link
Member

bjarthur commented Apr 1, 2018

@kleinschmidt @cecileane have you two had a chance to checkout Gadfly master since this PR merged? just want to make sure all is okay before we tag a release. thanks!

@kleinschmidt
Copy link
Contributor

I’ll have a look later today.

@cecileane
Copy link

Thanks @bjarthur! Yes I checked out Gadfly, and I first ran into a bunch of problems (see this issue). I didn't figure out why, but things are now working (on a different computer) after checking out the master branch of Compose 👍 . After you guys publish a new version of Gadfly (and Compose?), I will add Gadfly back into PhyloPlots.

@cecileane
Copy link

well, my earlier comment was using julia v0.6.1. Things don't work with julia v0.6.2. At compilation, I get the same error as mentioned here, with a very cryptic stacktrace:

ERROR: LoadError: MethodError: no method matching convert(::Type{AssertionError}, ::String)
Closest candidates are:
  convert(::Type{Any}, ::ANY) at essentials.jl:28
  convert(::Type{T}, ::T) where T at essentials.jl:29

and final error:

MethodError(Core.Inference.convert, (AssertionError, "invalid age range update"), 0x0000000000000ac6)

This is using the master version of Gadfly and Compose, and adding back "using Gadfly" in PhyloPlots.jl. Not even including the file that uses Gadfly functions. Any idea welcome!

@andreasnoack
Copy link
Contributor Author

@cecileane Thanks for trying to test this. I think you might be hitting JuliaLang/julia#21653. It shouldn't be related directly to the recent changes in Gadfly.

@kleinschmidt
Copy link
Contributor

All is well for my use case! Thanks for working so hard on this everyone!

@cecileane
Copy link

yes indeed, thanks @andreasnoack. It looks like the same kind of error that @mauro3 showed, and it does seem sensitive to using lots of modules together.

@tlnagy
Copy link
Member

tlnagy commented Apr 4, 2018

Is DataArrays still needed in REQUIRE?

DataArrays

@bjarthur
Copy link
Member

bjarthur commented Apr 6, 2018

so just to clarify, with this now merged into master and IndirectArrays modified, does one need to use the --compilecache=no flag when using Gadfly?

@andreasnoack
Copy link
Contributor Author

andreasnoack commented Apr 6, 2018

does one need to use the --compilecache=no flag when using Gadfly?

No. Things should work and the flag is not present in .travis.yml.

The only regression that I'm aware of is #1130. I think I know how to fix it but I'm off for a week of vacation on Sunday so might not get around fixing it before that. It is mainly a matter of changing the part of

Gadfly.jl/src/misc.jl

Lines 408 to 414 in 12327dd

function discretize_make_ia(values::CategoricalArray{T}, levels::Vector) where {T}
mapping = coalesce.(indexin(CategoricalArrays.index(values.pool), levels), 0)
unshift!(mapping, coalesce(findfirst(ismissing, levels), 0))
index = [mapping[x+1] for x in values.refs]
any(iszero, index) && throw(ArgumentError("values not in levels encountered"))
return IndirectArray(index, convert(Vector{T},levels))
end
that throws and instead convert the zeros in index to n+1 and add missing to levels. However,

Gadfly.jl/src/misc.jl

Lines 397 to 398 in 12327dd

discretize_make_ia(values::AbstractVector, levels) =
IndirectArray(Array{UInt8}(indexin(values, levels)), levels)
should probably also be changed similarly in case the input is e.g. a Vector{String}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants