-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update to DataFrames v0.11 (CategoricalArrays + Missings) #28
Conversation
def7de9
to
24fc1bf
Compare
Oops, still a little bit too early. It needs |
3d2dfd5
to
e969026
Compare
The CI pass on v0.6, but not on nightly (some error in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great!
NEWS.md
Outdated
##### Changes | ||
* R logical vectors converted to `Vector{Bool}` (instead of `Vector{Int32}`) | ||
* R factors converted into `CategoricalVector` (instead of `PooledDataArray`) ([#19]) | ||
* switched from DataVector to Vector{Union{T,Null}} for NAs ([#19]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing backticks.
NEWS.md
Outdated
@@ -15,5 +25,6 @@ Initial release based on `DataFrames.read_rda()` ([JuliaStats/DataFrames.jl#1031 | |||
[#9]: https://github.com/JuliaStats/RData.jl/issues/9 | |||
[#10]: https://github.com/JuliaStats/RData.jl/issues/10 | |||
[#15]: https://github.com/JuliaStats/RData.jl/issues/15 | |||
[#19]: https://github.com/JuliaStats/RData.jl/issues/19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong place?
REQUIRE
Outdated
DataFrames 0.7 | ||
DataArrays 0.3 | ||
julia 0.6 | ||
DataFrames 0.10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be 0.11.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I should keep it 0.10 for now, otherwise the package won't install.
src/context.jl
Outdated
int2ver(v::Integer) = VersionNumber(v >> 16, (v >> 8) & 0xff, v & 0xff) | ||
function RDAContext{T<:RDAIO}(io::T, kwoptions::Vector{Any}=Any[]) | ||
|
||
function RDAContext(io::T, kwoptions::Vector{Any}=Any[]) where {T<:RDAIO} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to define T
.
src/convert.jl
Outdated
# if re or im is NA, the whole complex number is NA | ||
# FIXME avoid temporary Vector{Bool} | ||
namask(rc::RComplexVector) = BitArray([isna_float64(v.re) || isna_float64(v.im) for v in reinterpret(Complex{UInt64}, rc.data)]) | ||
namask(rc::RComplexVector) = [isna_float64(v.re) || isna_float64(v.im) for v in reinterpret(Complex{UInt64}, rc.data)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better use broadcast
with an anonymous function so that the result is a BitArray
. Though I see you switched to Vector{Bool}
in the RNullableVector
struct. Why change that? Also, other namask
methods return a BitArray
, so that change may force an inefficient conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a legacy from a switch to DataTables and the fact that I forgot that vectorizing logical operation gives BitVector
.
I'll fix namask()
to return BitVector
, but should RLogicalVector
be converted into Vector{Bool}
or BitVector
given that it will most likely be a column in a data frame?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the difference matters a lot, given that the conversion to Array{Union{Null, T}}
won't be able to reuse the mask directly. Array{Bool}
should be faster, but will also use 8× more memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the revision I've removed namask()
, because with Union{T,Null}
we don't actually need NA mask. That was the legacy from DataArray
/NullableArray
times.
src/convert.jl
Outdated
# convert R logical vector (uses Int32 to store values) into [Nullable]Vector{Bool} | ||
function jlvec(rl::RLogicalVector, force_nullable::Bool=true) | ||
na_mask = namask(rl) | ||
vals = rl.data .!= zero(eltype(rl.data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.!= 0
should be enough here.
src/convert.jl
Outdated
return PooledDataArray(DataArrays.RefArray(refs), pool) | ||
# FIXME set ordered flag | ||
refs = na2zero(REFTYPE, ri.data) | ||
anyna = findfirst(refs, zero(REFTYPE)) > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just any(iszero, refs)
src/convert.jl
Outdated
end | ||
|
||
# converts Vector{Int32} into Vector{R} replacing R_NA_INT32 with 0 | ||
# it's assumed that v fits into R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How safe is that assumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only caller so far is jlvec()
method below, and it calculates R
to fit all the factor levels.
src/convert.jl
Outdated
# FIXME remove Any type assertion workaround | ||
DataFrame(Any[data(col) for col in rl.data], map(identifier, names(rl))) | ||
# FIXME add forceNullable option to control whether always convert to Union{T, Null} | ||
DataFrame(Any[jlvec(col, true) for col in rl.data], identifier.(names(rl))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrames no longer automatically converts columns to nullable, probably better keep them non-nullable when there are no nulls.
src/io/XDRIO.jl
Outdated
@@ -25,7 +25,11 @@ function readfloatorNA(io::XDRIO, n::RVecLength) | |||
reinterpret(Float64, map!(ntoh, v, v)) | |||
end | |||
|
|||
readuint8(io::XDRIO, n::RVecLength) = readbytes(io.sub, n) | |||
function readuint8(io::XDRIO, n::RVecLength) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't read(io, UInt8, n)
be enough?
src/convert.jl
Outdated
na_mask = namask(rv) | ||
anyna = any(na_mask) | ||
if force_nullable || anyna | ||
function jlvec(::Type{T}, rv::RVEC, forceNullable::Bool=true) where T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
force_nullable
was a more Julian name. Same below.
src/convert.jl
Outdated
res = convert(Vector{Union{T,Null}}, rv.data) | ||
if anyna | ||
@inbounds res[na_mask] = null | ||
for (i,x) in enumerate(rv.data) | ||
isna(x) && @inbounds res[i] = null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think putting @inbounds
before for
will generate more efficient code since it will also apply to enumerate
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably you're right, and I'll fix it.
But it's natural to assume that for x in rv.data
would do @inbounds
access, and so should do for (i,x) in enumerate(rv.data)
. Maybe something could be tweaked in the Base
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's definitely how it should work, but currently AFAIK @inbounds
still makes a difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before merging we'll have to clean the Travis config and bump the DataFrames dependency to 0.11.
575b9f9
to
1e06c2c
Compare
@nalimilan Thanks for the review. I've just squashed the fixups into appropriate commits, but the cumulative changeset is identical to what you have just reviewed. So I'll wait until DataFrames 0.11. |
c313f78
to
eaf82ff
Compare
I'm waiting for JuliaLang/METADATA.jl#12198 to see if this PR works on nightlies. |
- switch to Missings.jl/CategoricalArrays.jl - add jlvec() methods handling conversion logic - fix conversion of RLogicalVector into Vector{Bool} - remove DataArrays.jl dependency
(::Type{XXX})(...) -> XXX(...)
reinterpret() returns ReinterpretArray in 0.7 instead of Vector, whereas RVector{} only accepts Vector
allows to fix the result of readcomplex() from ReinterpretArray to Vector on 0.7
I suppose now it's the time for #19:
DataFrames.jl
v0.10v0.11 and switch toNulls.jl
Missings.jl
andCategoricalArrays.jl
fromDataArrays.jl
(NAs are representing asUnion{T, Null}
).