-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrames 0.11+ compat via IndirectArrays #1090
Conversation
ea27bfd
to
ad42de7
Compare
src/Gadfly.jl
Outdated
@@ -1121,7 +1123,7 @@ end | |||
|
|||
import Juno: Juno, @render, media, Media, Hiccup | |||
|
|||
media(Plot, Media.Plot) | |||
# media(Plot, Media.Plot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will break Juno's plot rendering.
ad42de7
to
1980e9c
Compare
So the issue mentioned above is fixed by JuliaLang/julia#25832. I've tried to cherry-pick the fix on |
I think we're getting pretty close to a 0.7 release, no? I think that at this point it's been several months since the new DataFrames came out and the damage is likely done (people either pinned DataFrames or dropped Gadfly) so I wouldn't mind waiting till 0.7 came out.
Awesome! |
@andreasnoack could you add |
I'd need to finish up JuliaArrays/IndirectArrays.jl#12 first but then I might try to add the flag here. |
i tried checking out this PR, the PR in IndirectArrays, and using |
I'm able to reproduce that error. I thought I had them all working. Will take a look. |
1980e9c
to
bc815d1
Compare
Oh. Now I know what is going on. I had added what I thought was a harmless deprecation fix. Should work now. Btw. There'll be some issues with the testing when we transition to 0.7 because 0.7 doesn't produce the same random |
can we use a different seed to reproduce the old behavior? |
Unfortunately, I don't think that is possible. See also JuliaLang/julia#25058 (comment) |
It looks like there's an issue somewhere here with levels/sorting. I'm getting a mismatch between group labels and the actual data plotted using current master (which is correct): this PR (with @andreasnoack PR on IndirectArrays.jl as well): |
I don't really understand what's going on in the figure you showed, but if that's related to the order of levels then it might be due to the fact that |
src/scale.jl
Outdated
discretize_make_pda(values::CategoricalArray, ::Void) = discretize_make_pda(values) | ||
function discretize_make_pda(values::CategoricalArray{T}, levels::Vector) where {T} | ||
index = map!(t -> ismissing(t) ? | ||
findfirst(ismissing, levels) : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really weird indentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the first four space aligned position after the (
. Arguments are aligned to the (
but this splits a ternary instead of two arguments so it should be indented. The call is not super easy to read as is but the indentation makes it easier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT there are only three spaces, not four (assuming the count starts after (
, i.e. starts with the t
). More importantly, what bugs me is that the second operand isn't aligned with the first one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, and more importantly, why is get
needed here? Shouldn't be AFAICT. That's also going to be very very slow. Better do something like:
mapping = coalesce.(indexin(CategoricalArrays.index(values.pool), levels), 0)
pushfirst!(mapping, coalesce(findfirst(ismissing, levels), 0))
mapping .+= 1
index = [mapping[x+1] for x in values.refs]
EDIT: I'm not sure what should happen for values which are not in levels: throw an error?
The same code should also work when levels
is a CategoricalVector
(i.e. no need for the special method below).
Could also drop the "pda" terminology.
I'll have to take a closer look. Do you have a minimal example to reproduce this? My guess is that I'd have to be more careful when converting a CategoricalVector to an indirect Vector such that the |
Sorry, I don't use Gadfly, but I guess something like this should do if you set a custom non-alphabetical order for countries (e.g. grouped by regions). Note that it may already be working, I just happened to think about it when reading the description of the issue above. |
@nalimilan the problem with the figure I showed is that the labels on the facets (\alpha=0.01, 0.1, 1, 10) are wrong. with the PR they've been circshifted by one, while the facets are still plotted in the same order (just the labels have changed). The values that are used to group the facets are a data frame column that's an I will look into this more later and figure out a minimal example but don't have time right at the moment. |
i merged a PR which requires some conflicts to be resolved with this one. sorry, but they are small. good thing is that regression testing is even easier now! |
bc815d1
to
a10501b
Compare
@kleinschmidt I've pushed a possible fix for this. Would be great if you could try it out. @bjarthur No problem. I've rebased. |
before we merge this, would it make sense to alter REQUIRE to put a ceiling on DataFrames of 0.11.4? if so, i'd say let's do that ASAP. also, why were breaking changes in DataFrames made on a point release? shouldn't they have bumped the minor version? |
I'm not sure I follow. Why would you put a ceiling there? We could try to put a ceiling at the next breaking changing version but DataFrames doesn't follow SEMVER (yet) so it for to know if it 0.14, 0.17, or 1.0. The breaking change in I'll try to finish the IndirectArrays PR soon such that we can have this merged. |
the motivation to upperbound DataFrames now is to fix the master branch of Gadfly now and give you time to finish without hurrying. i assumed the breaking change was 0.11.4 since that's what this PR specifies in its REQUIRE. |
CategoricalArrays
905db90
to
98f97ac
Compare
The previous test error was because of an improvement in |
REQUIRE
Outdated
@@ -16,3 +16,10 @@ Loess | |||
Showoff 0.0.3 | |||
StatsBase | |||
Juno | |||
IndirectArrays 0.4.1 | |||
Missings | |||
# Gadfly doesn't use WeakRefString directly but because of a but in Julia 0.6.2, Gadfly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bug not but
Old behavior: a = DataFrame(diff = PooledDataArray(Float64[1,2,3,3,3,4,3,2]))
plot(a, x="diff", Geom.histogram) New behavior: a = DataFrame(diff = CategoricalArray(Float64[1,2,3,3,3,4,3,2]))
plot(a, x="diff", Geom.histogram) I think the new behavior makes sense with the transition from a = DataFrame(diff = Float64[1,2,3,3,3,4,3,2])
plot(a, x="diff", Geom.histogram) |
I'm okay with merging since tests are passing and there aren't any regressions. @bjarthur, thoughts? |
@@ -173,6 +174,9 @@ function apply_scale(scale::ContinuousScale, | |||
for (aes, data) in zip(aess, datas) | |||
for var in scale.vars | |||
vals = getfield(data, var) | |||
if vals isa CategoricalArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a strong reason to throw an error here? I fully agree that by default CategoricalArray
should be treated as categorical even if it contains numeric values (that was the rationale for moving from PooledDataArray
), but if the user explicitly requests a continuous scale, why not accept it? Or is it because the code fails later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a strong reason to throw an error here?
Yes. It won't work. The CategorialcalValue
s will cause failures sooner or later so I thought an informative error would be more helpful. The specific error in the example I tried is a missing isfinite(::CategoricalValue)
. As mentioned previously, the core of this PR is to avoid that CategoricalValue
s escape to Compose
. In the discrete case, all data vectors go through discretize
where CategoricalArray
s are converted to IndirectArrays
to avoid CategoricalValue
s but in the continuous case there isn't a similar function we can use for making sure all CategoricalArray
s are converted.
that was the rationale for moving from
PooledDataArray
I don't really understand why it wouldn't be exactly the same thing for PooledDataArray
s. I'd also expect them to be treated as categorical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. That can always be improved later anyway if somebody needs it. Thanks for finishing this porting work!
I don't really understand why it wouldn't be exactly the same thing for PooledDataArrays. I'd also expect them to be treated as categorical.
Yeah, but there had always been an ambiguity, as PDAs were also used as an efficient storage type (e.g. they supported math operations). Now the distinction should be clearer, if only because of the CategoricalArray
name.
|
I get the plots below. Indeed the years have flipped but I think the new version is the correct one. julia> barley = dataset("lattice", "barley");
julia> levels!(barley[:Year], ["1931", "1932"]);
julia> barley[(barley[:Year] .== "1931") .& (barley[:Site] .== "Morris") .& (barley[:Variety] .== "No. 475"),:]
1×4 DataFrames.DataFrame
│ Row │ Yield │ Variety │ Year │ Site │
├─────┼───────┼─────────┼──────┼────────┤
│ 1 │ 22.6 │ No. 475 │ 1931 │ Morris │
julia> barley[(barley[:Year] .== "1932") .& (barley[:Site] .== "Morris") .& (barley[:Variety] .== "No. 475"),:]
1×4 DataFrames.DataFrame
│ Row │ Yield │ Variety │ Year │ Site │
├─────┼─────────┼─────────┼──────┼────────┤
│ 1 │ 44.2333 │ No. 475 │ 1932 │ Morris │ cachedgenned |
agreed! i've tagged gadfly and compose. as soon as that's merged into METADATA i'll merge this. |
@kleinschmidt @cecileane have you two had a chance to checkout Gadfly master since this PR merged? just want to make sure all is okay before we tag a release. thanks! |
I’ll have a look later today. |
Thanks @bjarthur! Yes I checked out Gadfly, and I first ran into a bunch of problems (see this issue). I didn't figure out why, but things are now working (on a different computer) after checking out the master branch of Compose 👍 . After you guys publish a new version of Gadfly (and Compose?), I will add Gadfly back into PhyloPlots. |
well, my earlier comment was using julia v0.6.1. Things don't work with julia v0.6.2. At compilation, I get the same error as mentioned here, with a very cryptic stacktrace:
and final error:
This is using the master version of Gadfly and Compose, and adding back "using Gadfly" in PhyloPlots.jl. Not even including the file that uses Gadfly functions. Any idea welcome! |
@cecileane Thanks for trying to test this. I think you might be hitting JuliaLang/julia#21653. It shouldn't be related directly to the recent changes in Gadfly. |
All is well for my use case! Thanks for working so hard on this everyone! |
yes indeed, thanks @andreasnoack. It looks like the same kind of error that @mauro3 showed, and it does seem sensitive to using lots of modules together. |
Is Line 8 in e36d55c
|
so just to clarify, with this now merged into master and IndirectArrays modified, does one need to use the |
No. Things should work and the flag is not present in The only regression that I'm aware of is #1130. I think I know how to fix it but I'm off for a week of vacation on Sunday so might not get around fixing it before that. It is mainly a matter of changing the part of Lines 408 to 414 in 12327dd
index to n+1 and add missing to levels . However, Lines 397 to 398 in 12327dd
Vector{String} .
|
@tlnagy Here is my version of the DataFrame 0.11+ compatible version. This version passes locally (see caveat below) with this version of IndirectArrays. I decided that the wrapping behavior of CategoricalArrays was too inconvenient to get working with the current setup because it is implicitly assumed in most functions that "categorical" arrays return the same kand of values as non-categorical arrays. Maybe by restructuring the code to rely more on dispatch to smaller specialized methods it might be possible to use CategoricalArrays internally, but I'm not sure it is worth it. Basically, I don't think it is useful to have the
pool
handy for the use cases here in Gadfly.Although the tests pass, it requires that Julia runs with
--compilecache=no
. Otherwise, it triggers what appears to be completely spurious ambiguity errors that we'd need @vtjnash or @JeffBezanson to comment on. It is hard for me to narrow down the issue but it could be related tomodule
/using
/include
. E.g. I can trigger/detrigger the issue by including a file where all lines are commented. I've also been able to trigger/detrigger it with afoo() = 1
definition in the middle of thestatistics.jl
file.