Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for R Dates and POSIXct but no support for handling timezone support #35

Merged
merged 28 commits into from
Mar 27, 2018

Conversation

jsams
Copy link
Contributor

@jsams jsams commented Nov 24, 2017

Branched off my rds branch, which maybe isn't ideal for you, but hopefully it works out.

It seems like supporting timezones is not possible because R uses an ambiguous string format. It doesn't seem as if RCall is doing anything about timezones either. I am a little concerned that RCall's magic constant for converting dates is different from mine, but you can see where mine comes from and is producing correct results.

except for the timezone issue, should close #34

@nalimilan
Copy link
Member

It seems like supporting timezones is not possible because R uses an ambiguous string format.

According to this StackOverflow answer, the three-letter abbreviations are only used for printing, but POSIXct objects contain the full name in the tzone attribute.

src/convert.jl Outdated
if hasnames(rv)
if class(rv) == ["Date"]
return date2julia(rv, hasna, nas)
elseif class(rv) == ["POSIXct"; "POSIXt"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

; is a concatenation operator, should be just ,.
Also, since you are reusing ["Date"] and ["POSIXct", ...] constants here and in date[time]2julia(), it makes sense to declare them as constants (const Date_class, POSIXct_class)

@alyst
Copy link
Collaborator

alyst commented Nov 24, 2017

Thanks for your work!

Branched off my rds branch, which maybe isn't ideal for you, but hopefully it works out.

You're right, it's not ideal :) For 2 reasons: 1) it's harder to review, 2) I cannot merge it since it clashes with #33.
You can just branch-off the master and cherry-pick the last two commits from this branch.
Of course, the tests use RDS files, which are not supported by master yet, but that should be easy to fix.
Or we wait until #33 lands.

I am a little concerned that RCall's magic constant

RCall has some tests for datetime.
Have you checked that these dates/datetimes are correctly handled by your PR (and vice versa)?

@jsams
Copy link
Contributor Author

jsams commented Nov 25, 2017

On timezones, it's not clear to me what the behavior should be if I try to do that. There's three scenarios:

  1. No timezone specified
  2. A proper (TimeZones.jl interpretable) timezone is specified
  3. An improper (TimeZones.jl uninterpretable) timezone is specified

(Despite the stack overflow question @nalimilan referred to, all the data I have on me has the three letter abbreviations in tzone, though it does also generate more reliable strings when given enough information at constructor time.)

Because they are so much easier to work with, I think we should prefer DateTimes over ZonedDateTimes where possible. So, in my mind, the correct behavior would be

  1. Read as if the data is stored as GMT, convert to local time, return DateTime
  2. Return as ZonedDateTime in the specified timezone
  3. Warn with unreadable tzone string, read as if in local time, return DateTime

Given my experience, I think 1. is the correct behavior as it would allow correct round-robin between R and Julia if Julia were to also write out in GMT but without specifying a timezone. Anything else I think would produce issues.

I think 3. is about the best we can do, and the lack of a ZonedDateTime let's the user know we aren't taking a strong stance on the accuracy of the time zone.

But I also see that 2 could just as easily be:
a. Read as specified timezone, convert to local time, return DateTime
b. Read as specified timezone, return as DateTime

Unfortunately, trying to test anything that plays with localtime will be a pain to test as the results will depend on what timezone the machine doing the test is in. The test would basically just duplicate the timezone converting code in the function, which kind of defeats the point of testing.

Alternatively, we can just return ZonedDateTime in all instances, and if the user wants it in localtime as a DateTime, they can convert post-import. Here, we would still treat scenario 1 as GMT and scenario 3 as localtime with a warning. This is much easier to implement/test, and I think would make the code much less complex.

@alyst
Copy link
Collaborator

alyst commented Nov 25, 2017

In general we should try to keep the imported data intact. 1. (GMT assumption and automatic conversion to local time) doesn't quite fit to that philosophy.
Ideally, it would be nice to return ZonedDataTime in the same time zone as specified in R, because someone might actually need timezones as they are stored in R. So it's unfortunate that we cannot rely on "tzone" attr for conversion into ZonedDateTime.

What about the following plan:

  1. we add timezone=:ignore/:keep keyword argument to load()/sexp2julia(), :ignore being the default.
  2. If timezone=:ignore we simply ignore tzone attr and return DateTime
  3. we introduce RDateTime (or POSIXct) struct that has datetime::DateTime and tzone::String fields. If timezone=:keep, sexp2julia() would convert R POSIXct into RDateTime preserving tzone attribute. For the time being convert(Type{DateTime}, d::RDateTime) would just return datetime field ignoring the timezone. We can also add convert(Type{ZonedDateTime}, d::RDateTime) along the logic you proposed -- but this could be done in another PR.

The default timezone=:ignore should be absolutely fine for ~90% of the usecases as they do not need timezone data.
The remaining 10% would use timezone=:keep and decide how to handle timezones on their own.

@alyst
Copy link
Collaborator

alyst commented Nov 25, 2017

@jsams It looks like you will have to rebase this and RDS PRs when #28 would be merged (very soon). Unfortunately, it refactors the conversion code, so there will be conflicts during the rebase. However, the changes should be pretty straightforward.

@nalimilan
Copy link
Member

I'm very reluctant to discard any timezone information silently, as it can give completely incorrect results. I think a good rule would be to adopt exactly the same behavior as R. That is, we should interpret time zone codes just as it would on conversion, so that further operations in Julia will give the same behavior as in R.

According to ?timezone ("Time zone names" section), three-letter codes are matched to those used in the Olson database. So the codes have a well-defined meaning that we can use. TimeZones.jl has chosen not to allow them because they are ambiguous for humans. But since TimeZones.jl parses the data, it should be possible to use it to parse the R POSIXct three-letter codes and replace them with the full time zone coes. This feature should probably be added to that package, since it can be useful in general for importing data, e.g. in the CSV format.

@jsams
Copy link
Contributor Author

jsams commented Nov 27, 2017

I've pushed a commit that returns ZonedDateTimes in all instances and makes a best effort attempt to use the available timezone information and falls back to UTC otherwise. As TimeZones.jl improves, the guesses of the TimeZone should as well.

I think kludging together our own DateTime type is less useful than warning and providing a type that already has a reasonably rich set of functions for converting from different time zones. Also, I don't have time to do all of that. I need to get back to working on my dissertation full time.

I've tried to craft these PR's to be useful for you and your package and to get them to your standards. But I don't have time to chase after moving goal posts. Thanks for the guidance and writing the package in the first place!

@nalimilan
Copy link
Member

My goal wasn't to discourage you. It's fine to start with a limited support for time zones, as long as the warning makes it clear to users that something may not work as they expect. We can always make the system more complex later when somebody needs it.

@jsams
Copy link
Contributor Author

jsams commented Mar 14, 2018

I'll see if I can update this to the latest master as well later this evening. Thanks for pulling in the rds stuff!

Copy link
Collaborator

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for resuming this PR!
Overall, I looks fine, except few rearrangements (jlvec()) and avoiding DataArrays.

src/convert.jl Outdated
@@ -1,6 +1,9 @@
# converters from selected RSEXPREC to Hash
# They are used to translate SEXPREC attributes into Hash

import TimeZones: unix2zdt
import DataArrays: @data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need DataArray. Currently RData doesn't require DataArrays.
Vector{Union{Data, Missing}} would do just fine.

src/convert.jl Outdated
@@ -98,8 +101,11 @@ function sexp2julia(rv::RVEC)
# TODO dimnames?
# FIXME add force_missing option to control whether always convert to Union{T, Missing}
jv = jlvec(rv, false)
if hasnames(rv)
# if data has no NA, convert to simple Vector
if class(rv) == R_Date_Class
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current design jlvec() is supposed to do the job of converting R vectors to Julia ones.
So class(rv) checking should go to jlvec(rv::RVEC), data2julia() should become jlvec(Dates.Date, rv::RNullableVector{R}, force_missing::Bool=true) and datetime2julia() -> jlvec(DateTime, ...).
The benefit is that you don't have to replicate the support of vector names or scalar conversion (which should only be done for top-level objects or for list elements, but not for dataframes).

src/convert.jl Outdated
@@ -128,3 +134,53 @@ function sexp2julia(rl::RList)
map(sexp2julia, rl.data)
end
end

function date2julia(rv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jlvec(Date, ...) as noted above

src/convert.jl Outdated
ZonedDateTime(Dates.unix2datetime(seconds), tz, from_utc=true)
end

function datetime2julia(rv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jlvec(DateTime, ...) as noted above

src/convert.jl Outdated
else
dates = Dates.epochdays2date.(rv.data .+ epoch_conv)
end
if hasnames(rv)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just return dates. Any further processing should be handled by sexp2julia()

end

function unix2zdt(seconds::Real; tz::TimeZone=tz"UTC")
ZonedDateTime(Dates.unix2datetime(seconds), tz, from_utc=true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if tz != tz"UTC", is from_utc still correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The tests ensure that the time and timezone is preserved for a non-UTC timezone.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method definition is a duplicate of that imported from TimeZones

test/RDS.jl Outdated
@@ -1,7 +1,9 @@
module TestRDS
using Base.Test
using DataArrays
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for DataArrays.

test/RDS.jl Outdated
@@ -42,5 +44,59 @@ module TestRDS
@test eltypes(rdf_decomp) == eltypes(df)
@test isequal(rdf_decomp, df)
end

@testset "Test Date conversion" begin
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's ok to have extensive date/datetime tests in RDS testset, but we also need to check that conversion of Date/DateTime columns is also supported (I guess with the current code it would fail for 1-row dataframes). Could you please add these to RDA.jl?

NEWS.md Outdated
@@ -2,9 +2,11 @@

##### Changes
* add support for `.rds` files (single object data files from R) [#22], [#33]
* add support for `Date` and `POSIXct`, though still lacking complete timezone handling [#34]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it would fail if R has some exotic timezone?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it fails for e.g. 3-letter timezones, as I documented before. And generally, I'm outsourcing all timezone handling to TimeZones.jl. Anything they don't handle, I don't. Further, anything can be put in that attribute (which is after all just a string); so, it's entirely possible that R allows things that would be difficult to know what to do with.

In short, I'm not fully replicating R's behavior.

test/RDS.jl Outdated
@test datetimes[1] == ZonedDateTime(DateTime("2017-01-01T21:23"), tz"UTC")
# tz"CST" is invalid, but if TimeZones ever enables support for these 3
# letter codes, a test would be useful. For now, intentionally not testing
#@test_broken datetimes[2] == ZonedDateTime(DateTime("2017-01-01T13:23"), tz"CST")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to test that unsupported timezones behave as we expect

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is why I have the @test_warn when loading the file. It warns that things may not be working as expected. That said, I couldn't get the @test_broken to actually work the way I thought it was supposed to. If you have better advice here, I'll try it out. Will comment again when I have made your suggested changes.

jsams added 2 commits March 15, 2018 19:27
    * refactor to use jlvec
    * replace try block with key lookup
    * remove DataArray dependency
    * factor date conversion to rdays2date
    * add test for date and datetimes in data frames, including single row
@coveralls
Copy link

coveralls commented Mar 16, 2018

Coverage Status

Coverage increased (+5.5%) to 73.188% when pulling e86ee6d on jsams:dates into 04bf272 on JuliaData:master.

@jsams
Copy link
Contributor Author

jsams commented Mar 16, 2018

Ok, made the requested changes and added dataframe checks for both single and multi-row dataframes.

addendum: Made another little refactoring that just occurred to me makes sense. It's not clear what to do when R does not specify any timezone. Right now, I'm assuming UTC and not warning. It might be reasonable to want to warn about this? I'll leave that decision to you as to how noisy and pedantic you want your library to be.

@alyst
Copy link
Collaborator

alyst commented Mar 27, 2018

@jsams Thanks for the updates! I've made some small changes, I hope you are fine with them.

@jsams
Copy link
Contributor Author

jsams commented Mar 27, 2018

Yea, I don't have any problems with those changes. Thanks!

@alyst alyst merged commit 2de528a into JuliaData:master Mar 27, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support POSIXct and Date objects
5 participants