Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POSIXct -> nanotime -> POSIXct consistency #115

Closed
DavorJ opened this issue Jun 3, 2023 · 13 comments
Closed

POSIXct -> nanotime -> POSIXct consistency #115

DavorJ opened this issue Jun 3, 2023 · 13 comments

Comments

@DavorJ
Copy link

DavorJ commented Jun 3, 2023

Take this timestamp:

ts <- as.POSIXct(as.numeric("0x1.91ef2f6e3abb9p+30"), origin = '1970-01-01')

> print(as.numeric(ts), digits = 22)
[1] 1685834715.5573561

> nanotime::as.nanotime(ts)
[1] 2023-06-03 23:25:15.557356000

> print(as.numeric(as.POSIXct(nanotime::as.nanotime(ts))), digits = 22)
[1] 1685834715.5573559

Hence the original POSIXct is not preserved.

Nanotime has extra 3 digits of precision, but it seems that POSIXct double precision is cut off at µs and the 3 extra digits are not utilized (i.e. they are hardcoded to 0). If they were used to describe the double, then this situation would not be possible, and POSIXct -> nanotime -> POSIXct consistency would be guaranteed, no?

One could also argue that the conversion, as is now, is loosing information from POSIXct timestamp.

I am using Rcpp_1.0.7, RcppCCTZ_0.2.10, zoo_1.8-12 and nanotime_0.3.7 on R v3.6.3.

@eddelbuettel
Copy link
Owner

eddelbuettel commented Jun 4, 2023

Can we back off here a little? What is "0x1.91ef2f6e3abb9p+30" supposed to be, and how do you reason it (whatever it is( fits into a _POSIXct which we now is a (64 bit) double (with a 53 bit mantissa) ?

Nobody claims nanotime can "losslessly" convert from "any possible POSIXct representation". All we claim, just like Python and Julia and others do, is that time can be represented as nanosecond increments since the epoch. No more, no less.

@DavorJ
Copy link
Author

DavorJ commented Jun 4, 2023

Hi Eddel. 0x1.91ef2f6e3abb9p+30 is just a sprintf('%a', number) representation of a double. The conversion is fine with most doubles, but not with all. Hence to make the example reproducible I need a very specific double.

I agree: information will always be lost in conversion, but you could make it less lossy by using the digits beyond µs precision. For example:

> nanotime::as.nanotime(bit64::as.integer64(as.numeric(ts)*1E9))
[1] 2023-06-03 23:25:15.557356032

> as.POSIXct(nanotime::as.nanotime(bit64::as.integer64(as.numeric(ts)*1E9))) == ts
[1] TRUE

Here I generate some random timestamps in tmp and then check in how many cases the conversion is consistent:

> tmp <- as.POSIXct(replicate(100000, utils$time$now()), origin = '1970-01-01')
> length(unique(tmp))
[1] 100000
> mean(tmp == as.POSIXct(nanotime::as.nanotime(bit64::as.integer64(as.numeric(tmp)*1E9))))
[1] 0.9317
> mean(tmp == as.POSIXct(nanotime::as.nanotime(tmp)))
[1] 0.23126

As is now, 23 % of timestamps were transformed consistently, while if I use the *1E9 it becomes 93 %. My question is, can we make it 100 %? Using *1E9 will introduce rounding errors of its own. A better way would be to just take the decimal part of the double and round it up to ns precision, and then convert to nanotime.

PS utils$time$now() is just an internal function based on microbenchmark::get_nanotime() I use to generate more exact now-timestamps since Sys.time() is not precise enough (i.e. it will generate lots of duplicates.)

@DavorJ
Copy link
Author

DavorJ commented Jun 4, 2023

Here is a good explanation of the issue, and a way to solve.

I dind't find this modf exposed by any package, so here is something very quick & dirty:

Rcpp::cppFunction('
  List modf(double x) {

    double ipart;
    double fraction = modf(x, &ipart);

    List result = List::create(_["int"] = ipart, _["fraction"] = fraction);
    return result;
  }
')

splt <- data.table::data.table(ts = tmp)
splt <- cbind(splt, data.table::rbindlist(lapply(as.numeric(splt$ts), modf)))
splt[, int64 := bit64::as.integer64(int)*1E9 + round(fraction*1E9)]
splt[, ntime := nanotime::as.nanotime(int64)]
splt[, consistent := ts == as.POSIXct(ntime)]
mean(splt$consistent) # outputs 1

Hence 100 % consistency, at least on the 100k previous samples. Wonder whether something like this would work in all cases? I don't understand all the details/pecularities yet though, so take it with a big grain of salt ;).

@DavorJ
Copy link
Author

DavorJ commented Jun 4, 2023

A bit further testing:

                            ts      int     fraction             int64                         ntime consistent
 1: 1970-01-02 13:12:18.245308   133938  0.245308399   133938245308399 1970-01-02 13:12:18.245308399      FALSE
 2: 1970-02-11 13:07:27.118945  3589647  0.118945599  3589647118945599 1970-02-11 13:07:27.118945599      FALSE
 3: 1969-12-05 00:53:03.532405 -2329617 -0.467594147 -2329617467594147 1969-12-05 00:53:02.532405853      FALSE

For timestamps close to the epoch, you can't guarantee the consistency since the fractional part of POSIXct is much more precise than nanotime in that region. But from a few months onward from the epoch, when the integer part gets relatively larger, I believe you can.

@DavorJ
Copy link
Author

DavorJ commented Jun 4, 2023

And here is a proposal that I tested on many current (1M) and random (between 1900-2100) timestamps:

#' POSIXct to nanotime
#' Enforces POSIXct -> nanotime -> POSIXct consistency for timestamps
#' further away than approximately  +/- 3 months away from Unix epoch.
#' See: https://github.com/eddelbuettel/nanotime/issues/115
as.nanotime.POSIXct <- function(x) {
  stopifnot(inherits(x, 'POSIXct'))
  xx <- as.numeric(x)
  frac <- sign(xx) * (abs(xx) %% 1) # equivalent to C++ modf
  nanotime::as.nanotime(bit64::as.integer64(xx)*1E9L + round(frac*1E9))
}

@eddelbuettel
Copy link
Owner

For timestamps close to the epoch, you can't guarantee the consistency since the fractional part of POSIXct is much more precise than nanotime in that region. But from a few months onward from the epoch, when the integer part gets relatively larger, I believe you can.

That one I was aware of -- the basic limitation of a double for a timestamp is the "uneven distribution" over the range of values. Which is related to the fact that we can have only have 'a little less than a microsec' precision for dates right now.

Which I had not carried 'forward' to 'much less precision for future dates' and 'backward' to 'more (but "useless") precision for past dates'.

And I like your proposal and the use of modf() (and it is not often that I an old C math library function that I had never (?) heard of (as the operator dominated my use)). I am sure @lsilvest and I will take a look!

@lsilvest
Copy link
Collaborator

lsilvest commented Jun 5, 2023

So the original implementation is:

setMethod("nanotime",
          "POSIXct",
          function(x) {
              ## force last three digits to be zero
              n <- names(x)
              res <- new("nanotime", as.integer64(as.numeric(x) * 1e6) * 1000)
              if (!is.null(n)) {
                  names(S3Part(res, strictS3=TRUE)) <- n
              }
              res
          })

The idea was to limit the precision to about the significant digits at our current epoch, which in retrospect is not of much use and probably annoying.

I do like the idea of a round trip that is mostly lossless, but looking at performance it does come at quite a cost:

library(nanotime)
library(rbenchmark)


to_nanotime_1 <- function(x) {
    n <- names(x)
    xx <- as.numeric(x)
    frac <- sign(xx) * (abs(xx) %% 1) # equivalent to C++ modf
    res <- new("nanotime", bit64::as.integer64(xx)*1E9L + round(frac*1E9))
    if (!is.null(n)) {
        names(S3Part(res, strictS3=TRUE)) <- n
    }
    res
}

to_nanotime_2 <- function(x) {
    n <- names(x)
    xx <- as.numeric(x)
    frac <- xx - trunc(xx)
    res = new("nanotime", bit64::as.integer64(xx)*1E9 + round(frac*1E9))
    if (!is.null(n)) {
        names(S3Part(res, strictS3=TRUE)) <- n
    }
    res
}

to_nanotime_3 <- function(x) {
    n <- names(x)
    res <- new("nanotime", bit64::as.integer64(as.numeric(x) * 1e9))
    if (!is.null(n)) {
        names(S3Part(res, strictS3=TRUE)) <- n
    }
    res    
}


d <- 1:1e6 + runif(1e6)
pct <- as.POSIXct(d, origin="1970-01-01")

benchmark(
    "n1" = {
        to_nanotime_1(pct)
    },
    "n2" = {
        to_nanotime_2(pct)
    },
    "n3" = {
        to_nanotime_3(pct)
    },
    replications = 100,
    columns = c("test", "replications", "elapsed",
                "relative", "user.self", "sys.self"))

Which on my computer gives:

  test replications elapsed relative user.self sys.self
1   n1          100   3.258    6.845     3.259        0
2   n2          100   2.900    6.092     2.900        0
3   n3          100   0.476    1.000     0.475        0

So maybe I'd be in favor of changing the current behavior to not truncate to milliseconds, but not try to preserve the round trip either. Except for a region around the epoch, these floating point numbers are equivalent in terms of significant digits. What are your thoughts?

@eddelbuettel
Copy link
Owner

Maybe a hybrid method by adding a new argument that allows to switch between 'fast(er)' and '(more) accurate'?

@DavorJ
Copy link
Author

DavorJ commented Jun 5, 2023

@lsilvest, here are a few regression tests that may be useful:

local({
  # This one should break if cutting off on µs
  # See: https://github.com/eddelbuettel/nanotime/issues/115
  ts <- as.POSIXct(as.numeric('0x1.91ef2f6e3abb9p+30'), origin = '1970-01-01')
  stopifnot(ts == as.POSIXct(as.nanotime.POSIXct(ts)))

  # This one should break if multiplying the double by 1E9 directly
  # without considering the integer and fraction parts separately.
  ts <- as.POSIXct(as.numeric('0x1.91f18e4d0065p+30'), origin = '1970-01-01')
  stopifnot(ts == as.POSIXct(as.nanotime.POSIXct(ts)))

  # This one is negative, making sure that negative doubles are also consistent.
  ts <- as.POSIXct(as.numeric('-0x1.c6e8c4d077ae4p+30'), origin = '1970-01-01')
  stopifnot(ts == as.POSIXct(as.nanotime.POSIXct(ts)))
})

The as.nanotime.POSIXct() should point to your function.

If my preference matters: I would also allow a switch.

PS bit64 implementation is... well... not-optimized?. On scalars it is also incredibly slow:

a <- bit64::as.integer64(2)
b <- bit64::as.integer64(2)
microbenchmark::microbenchmark(a*b, 2*2)
# Unit: nanoseconds
#   expr   min    lq  mean median    uq    max neval
#  a * b 56300 58200 63766  59200 62050 168700   100
#  2 * 2     0   100   250    200   300   1300   100

@lsilvest
Copy link
Collaborator

lsilvest commented Jun 7, 2023

I think it's optimized, but you are measuring the overhead of the S3 dispatch. If you choose a long vector I think you will see bit64 is efficient.

And thanks for the regression tests!

Back to some considerations on the problem, and I'll try to characterize exactly what we would be doing if we didn't do the rounding proposed here.

If I take the second number provided by @DavorJ and also look at the closest shortest decimal representations of the floating point immediately before and the floating point immediately after, we get:

0x1.91f18e4d0064fp+30 -> 1685873555.250385
0x1.91f18e4d0065p+30  -> 1685873555.2503853
0x1.91f18e4d00651p+30 -> 1685873555.2503855

So what we would be doing with the uncorrected version is effectively mapping 0x1.91f18e4d0065p+30 to 0x1.91f18e4d00651p+30.

And if we want to think about the precision that the fractional part of POSIXct the other way round:

> sprintf("%a", 1685873555.2503851)
[1] "0x1.91f18e4d0064fp+30"
> sprintf("%a", 1685873555.2503852)
[1] "0x1.91f18e4d0065p+30"
> sprintf("%a", 1685873555.2503853)
[1] "0x1.91f18e4d0065p+30"
> sprintf("%a", 1685873555.2503854)
[1] "0x1.91f18e4d0065p+30"
> sprintf("%a", 1685873555.2503855)
[1] "0x1.91f18e4d00651p+30"
> sprintf("%a", 1685873555.2503856)
[1] "0x1.91f18e4d00651p+30"

Still unsure what is the best option here, but the above shows the rounding errors are happening in the zone where POSIXct can in any case not ensure the exactness of its decimal. So I think it's not the precision we are bothered with here, but the effect of surprise on a user who will not get a guaranteed round trip.

@eddelbuettel
Copy link
Owner

I think it's optimized, but you are measuring the overhead of the S3 dispatch.

Had the exact same thought when reading: the benchmark is measuring the forced type conversion that is the cost and "tax" of bit64 to give us 64 bit integer meaning.

I alsso like the follow-up. It is know that POSIXct has design deficiencies and limits. We cannot change that. Maybe the best we can do is to consider a second (optional) conversion as suggested.

@lsilvest
Copy link
Collaborator

lsilvest commented Jun 8, 2023

Yes, that's most likely the best option. I'm slightly leaning towards having the exact conversion as default, maybe with the additional parameter accurate=TRUE? Unless someone has a better name for this parameter that also conveys the notion of speed? And maybe you prefer the default to be accurate=FALSE? I don't have a strong opinion here. Once we've decided on that I can make the changes and will add the units tests provided by @DavorJ.

@eddelbuettel
Copy link
Owner

I actually like that -- both the new as default choice, and the new opt-in switch for which accurate=TRUE is a good name.

eddelbuettel added a commit that referenced this issue Jul 8, 2023
add parameter 'accurate' for conversions from 'POSIXct'; closes #115
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants