Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-17637: [R] as.Date fails going from timestamp[us] to timestamp[s] #14935

Merged
merged 2 commits into from
Dec 14, 2022

Conversation

paleolimbot
Copy link
Member

Before this PR:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
#> Loading required package: timechange

# Use as_datetime() because as.POSIXct() truncates the fractional seconds
ds <- InMemoryDataset$create(data.frame(x = as_datetime('2022-05-05T00:00:01.676632')))
ds %>%
  mutate(date = as.Date(x)) %>%
  collect()
#> Error in `compute.arrow_dplyr_query()` at r/R/dplyr-collect.R:22:2:
#> ! Invalid: Casting from timestamp[us, tz=UTC] to timestamp[s, tz=UTC] would lose data: 1651708801676632
#> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec.cc:821  kernel_->exec(kernel_ctx_, input, out)
#> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec.cc:789  ExecuteSingleSpan(input, &output)
#> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:608  executor->Execute( ExecBatch(std::move(arguments), all_scalar ? 1 : input.length), &listener)
#> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/expression.cc:590  ExecuteScalarExpression(call->arguments[i], input, exec_context)
#> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/compute/exec/project_node.cc:91  ExecuteScalarExpression(simplified_expr, target, plan()->exec_context())
#> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:334  ReadNext(&batch)
#> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/record_batch.cc:348  ToRecordBatches()

#> Backtrace:
#>      ▆
#>   1. ├─ds %>% mutate(date = as.Date(x)) %>% collect()
#>   2. ├─dplyr::collect(.)
#>   3. └─arrow:::collect.arrow_dplyr_query(.)
#>   4.   └─arrow:::compute.arrow_dplyr_query(x) at r/R/dplyr-collect.R:22:2
#>   5.     └─base::tryCatch(...) at r/R/dplyr-collect.R:40:2
#>   6.       └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   7.         └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   8.           └─value[[3L]](cond)
#>   9.             └─arrow:::augment_io_error_msg(e, call, schema = schema()) at r/R/dplyr-collect.R:49:6
#>  10.               └─rlang::abort(msg, call = call) at r/R/util.R:251:2

Created on 2022-12-13 with reprex v2.0.2

After this PR:

library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for more information.
library(dplyr, warn.conflicts = FALSE)
library(lubridate, warn.conflicts = FALSE)
#> Loading required package: timechange

# Use as_datetime() because as.POSIXct() truncates the fractional seconds
ds <- InMemoryDataset$create(data.frame(x = as_datetime('2022-05-05T00:00:01.676632')))
ds %>%
  mutate(date = as.Date(x)) %>%
  collect()
#>                     x       date
#> 1 2022-05-05 00:00:01 2022-05-05

Created on 2022-12-13 with reprex v2.0.2

@github-actions
Copy link

@github-actions
Copy link

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@paleolimbot paleolimbot merged commit 702dbf3 into apache:master Dec 14, 2022
@paleolimbot paleolimbot deleted the r-as-date-datetime branch December 14, 2022 16:10
@ursabot
Copy link

ursabot commented Dec 15, 2022

Benchmark runs are scheduled for baseline = 4751c89 and contender = 702dbf3. 702dbf3 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.26% ⬆️0.07%] test-mac-arm
[Finished ⬇️0.54% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.41% ⬆️0.03%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 702dbf39 ec2-t3-xlarge-us-east-2
[Finished] 702dbf39 test-mac-arm
[Finished] 702dbf39 ursa-i9-9960x
[Finished] 702dbf39 ursa-thinkcentre-m75q
[Finished] 4751c89b ec2-t3-xlarge-us-east-2
[Finished] 4751c89b test-mac-arm
[Finished] 4751c89b ursa-i9-9960x
[Finished] 4751c89b ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot
Copy link

ursabot commented Dec 15, 2022

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

thisisnic added a commit that referenced this pull request Jan 11, 2023
…able to handle time in sub seconds (#13890)

This change allows strings containing sub-seconds and double types to be used as input to `lubridate::as_datetime()`.

```r
1.5 |>
  arrow::arrow_table(x = _) |>
  dplyr::mutate(
    y = lubridate::as_datetime(x)
  ) |>
  dplyr::collect() |>
  dplyr::mutate(
    z = lubridate::as_datetime(x),
    is_equal = (y == z)
  )
#>     x                   y                   z is_equal
#> 1 1.5 1970-01-01 00:00:01 1970-01-01 00:00:01     TRUE
```

And, because the timestamp type generated by `as_datetime` is expected to be used in combination with other functions, fix the bug of ~~`as.Date` and~~ `lubridate::as_date` that could cause an error if a sub-seconds timestamp was entered.

Edit: as.Date fixed by #14935

As a breaking change, the return type of `as_datetime()` will be nanoseconds, but I hope this will not have a major impact, since originally `as_datetime() |> as.integer()` or `as_datetime() |> as.numeric()` could not be used because it would try to cast to int32 or double, resulting in an error.
(We have to cast timestamp to int64)

arrow 9.0.0

```r
1 |>
  arrow::arrow_table(x = _) |>
  dplyr::mutate(
    x = lubridate::as_datetime(x),
    y = cast(x, arrow::int64())
  ) |>
  dplyr::collect()
#>                     x y
#> 1 1970-01-01 00:00:01 1
```

This PR

``` r
1 |>
  arrow::arrow_table(x = _) |>
  dplyr::mutate(
    x = lubridate::as_datetime(x),
    y = cast(x, arrow::int64())
  ) |>
  dplyr::collect()
#>                     x          y
#> 1 1970-01-01 00:00:01 1000000000
```

Lead-authored-by: SHIMA Tatsuya <ts1s1andn@gmail.com>
Co-authored-by: Nic Crane <thisisnic@gmail.com>
Signed-off-by: Nic Crane <thisisnic@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants