Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16510: [R] Add bindings for GCS filesystem #13404

Merged
merged 43 commits into from
Jun 26, 2022

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Jun 20, 2022

This adds basic bindings for GcsFileSystem to R, turns it on in the macOS, Windows, and Linux packaging (same handling as ARROW_S3), and basic R tests.

Followups:

  • Bindings for FromImpersonatedServiceAccount (ARROW-16885)
  • Set up testbench for fuller tests, like how we do with minio (ARROW-16879)
  • GcsFileSystem::Make should return Result (ARROW-16884)
  • Explore auth integration/compatibility with gargle, googleAuthR, etc.: can we pick up the same credentials they use (ARROW-16880)
  • macOS binary packaging: push dependencies upstream (ARROW-16883)
  • Windows binary packaging: push dependencies upstream (ARROW-16878)
  • Update cloud/filesystem documentation (ARROW-16887)

@github-actions
Copy link

strings
strings_internal
symbolize
synchronization
throw_delegate
time
time_zone)
time_zone
wyhash)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the cmake changes here seem to just be differences in sorting (due to my locale?), but the omission of wyhash seems to be legitimately new.

# sed -e 's;.*/absl_;set_property(TARGET absl::;' \
# -e 's/.pc:Requires:/ PROPERTY INTERFACE_LINK_LIBRARIES /' \
# -e 's/ = 20210324,//g' \
# -e 's/ = 20210324//g' \
# -E -e 's/ = 20[0-9]{6},?//g' \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simplified this regex so that it is robust to future abseil version bumps (at least until 2030!)

@@ -3542,18 +3544,6 @@ macro(resolve_dependency_absl)
APPEND
PROPERTY INTERFACE_LINK_LIBRARIES ${CoreFoundation})
endif()
set_property(TARGET absl::type_traits PROPERTY INTERFACE_LINK_LIBRARIES absl::config)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are duplicated above so I removed them here; I suspect these were an artifact of a bad merge

@@ -44,7 +44,9 @@ get_exported_functions <- function(decorations, export_tag) {
out <- decorations %>%
filter(decoration %in% paste0(export_tag, "::export")) %>%
mutate(functions = map(context, decor:::parse_cpp_function)) %>%
{ vec_cbind(., vec_rbind(!!!pull(., functions))) } %>%
{
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and the rest of the changes in the file seem to be just styler running on the file (vscode did it on save). Since the code generated after these style changes looks identical (there is no massive diff in the arrowExports.* files), maybe we can remove the styler_exclusion on this file?

r/src/arrow_types.h Outdated Show resolved Hide resolved
r/src/filesystem.cpp Outdated Show resolved Hide resolved
r/src/filesystem.cpp Outdated Show resolved Hide resolved
@nealrichardson
Copy link
Member Author

I turned on GCS wherever S3 was turned on (so NOT_CRAN=true will build GCS on linux). We should check the build times and see if it's worth the cost or whether it should be a more explicit opt-in. I think we should have it on in the binaries we build though.

@nealrichardson
Copy link
Member Author

R debian build is failing with an undefined symbol from absl, a different one from what I saw locally (on macOS):

https://github.com/apache/arrow/runs/6989838112?check_suite_focus=true#step:5:2990

@nealrichardson nealrichardson marked this pull request as ready for review June 21, 2022 18:39
@nealrichardson
Copy link
Member Author

@github-actions crossbow submit homebrew-r-autobrew

@github-actions
Copy link

Revision: df5033b

Submitted crossbow builds: ursacomputing/crossbow @ actions-b4426052af

Task Status
homebrew-r-autobrew Github Actions

@nealrichardson
Copy link
Member Author

@coryan any ideas about that undefined abseil symbol mentioned above? undefined symbol: _ZN4absl12lts_2021110212RFC3339_fullE

Similar issue in the homebrew build Symbol not found: __ZN4absl12lts_2021110210FormatTimeENS0_11string_viewENS0_4TimeENS0_8TimeZoneE See #13407 (comment)

@coryan
Copy link
Contributor

coryan commented Jun 21, 2022

TL;DR; not much of an idea, sorry.

any ideas about that undefined abseil symbol mentioned above?

I assume "above" refers to https://github.com/apache/arrow/runs/6989838112?check_suite_focus=true#step:5:2990 just let me know if I read that wrong.

undefined symbol: _ZN4absl12lts_2021110212RFC3339_fullE

Seems like Abseil is not being linked? The log says:

-- All bundled static libraries: Snappy::snappy;thrift::thrift;jemalloc;mimalloc::mimalloc;lz4::lz4;zstd::libzstd;re2::re2;utf8proc::utf8proc;google-cloud-cpp::storage;google-cloud-cpp::common;aws-cpp-sdk-identity-management;aws-cpp-sdk-sts;aws-cpp-sdk-cognito-identity;aws-cpp-sdk-s3;aws-cpp-sdk-core;AWS::aws-c-event-stream;AWS::aws-checksums;AWS::aws-c-common

That seems to be missing Abseil, but I do not quite understand how the list of bundled dependencies is created.

Similar issue in the homebrew build Symbol not found: __ZN4absl12lts_2021110210FormatTimeENS0_11string_viewENS0_4TimeENS0_8TimeZoneE See #13407 (comment)

Probably a similar problem as above. Less likely, it could be Abseil's sensitivity to C++11 vs. C++17 building (see abseil/abseil-cpp#696), I always think it could be that issue if string_view, variant or any are involved.

@nealrichardson
Copy link
Member Author

@coryan thanks, that's helpful actually. I'll have to dig more on the bundled libraries thing. On the Homebrew build, abseil indeed is being built with C++17 by brew (https://github.com/Homebrew/homebrew-core/blob/master/Formula/abseil.rb#L31), and the arrow build is using C++11. What's odd is that we've been depending on it (transitively) for a while because Flight uses grpc (also built C++17 by brew), but I guess google-cloud-cpp hits different parts.

@nealrichardson
Copy link
Member Author

Well, that didn't work, there must be some other piece I'm missing:

-- Configuring done
CMake Error at cmake_modules/BuildUtils.cmake:112 (file):
  Error evaluating generator expression:
    $<TARGET_FILE:absl::memory>
  Target "absl::memory" is not an executable or library.
Call Stack (most recent call first):
  src/arrow/CMakeLists.txt:608 (create_merged_static_lib)
CMake Error at cmake_modules/BuildUtils.cmake:112 (file):
  Error evaluating generator expression:
    $<TARGET_FILE:absl::memory>
  Target "absl::memory" is not an executable or library.
Call Stack (most recent call first):
  src/arrow/CMakeLists.txt:608 (create_merged_static_lib)

@coryan
Copy link
Contributor

coryan commented Jun 21, 2022

I suspect absl::memory is a header-only library, so it does not have a $<TARGET_FILE>.

@coryan
Copy link
Contributor

coryan commented Jun 22, 2022

If you have a VM with abseil installed, you could do something like:

pkg-config --libs absl_memory absl_strings absl_str_format absl_time absl_variant absl_base absl_memory absl_optional absl_span absl_time absl_variant
-L/usr/local/lib64 -labsl_str_format_internal -labsl_bad_optional_access -labsl_time -labsl_civil_time -labsl_strings -labsl_strings_internal -lrt -labsl_base -labsl_spinlock_wait -labsl_int128 -labsl_throw_delegate -labsl_time_zone -labsl_bad_variant_access -labsl_raw_logging_internal -labsl_log_severity

Which (a) removes the header-only libraries, and (b) saves you some whack-a-moling (maybe).

@nealrichardson
Copy link
Member Author

If you have a VM with abseil installed, you could do something like:

pkg-config --libs absl_memory absl_strings absl_str_format absl_time absl_variant absl_base absl_memory absl_optional absl_span absl_time absl_variant
-L/usr/local/lib64 -labsl_str_format_internal -labsl_bad_optional_access -labsl_time -labsl_civil_time -labsl_strings -labsl_strings_internal -lrt -labsl_base -labsl_spinlock_wait -labsl_int128 -labsl_throw_delegate -labsl_time_zone -labsl_bad_variant_access -labsl_raw_logging_internal -labsl_log_severity

Which (a) removes the header-only libraries, and (b) saves you some whack-a-moling (maybe).

Yeah I keep thinking I've reached the end (just one more library!) but that would be smarter. I have a local build of abseil already (from the arrow build), didn't think to point pkg-config at it. Thanks for the suggestion!

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit homebrew-r-autobrew

@github-actions
Copy link

Revision: 4d789f0

Submitted crossbow builds: ursacomputing/crossbow @ actions-f35855420a

Task Status
homebrew-r-autobrew Github Actions

@nealrichardson
Copy link
Member Author

For the R Windows packages, I need to use a static libcurl, and stack overflow tells me that I need to define CURL_STATICLIB for the google-cloud-cpp build. (The thing I just pushed won't work, it's not going to put it in the right place.) I don't see a way I can pass that in to the build macro from the outside (as in, from the cmake command that builds arrow), and I don't think we'd want to hard-code that. Is it possible in cmake to tell if find_package(curl) found a static or shared library? Then I think I could conditionally add that only when appropriate. @kou thoughts?

cc @jeroen in case you have different ideas

@jeroen
Copy link
Contributor

jeroen commented Jun 22, 2022

I think find_package() is supposed to shell out to pkg-config --cflags libcurl and pkg-config --libs libcurl which should automatically include the appropriate macros and static libraries. For example with rtools40 I see:

$ pkg-config --cflags libcurl
-DCURL_STATICLIB -IC:/rtools40/ucrt64/include

$ pkg-config --libs libcurl
-LC:/rtools40/ucrt64/lib -LD:/a/_temp/msys/msys64/ucrt64/lib -lcurl -lnormaliz -lssh2 -lcrypt32 -lgdi32 -lws2_32 -lssl -lcrypto -lws2_32 -lgdi32 -lcrypt32 -lz -lssl -lcrypto -lssl -lcrypto -lws2_32 -lgdi32 -lcrypt32 -lgdi32 -lcrypt32 -lwldap32 -lz -lws2_32

@wjones127
Copy link
Member

(I created docs follow up: https://issues.apache.org/jira/browse/ARROW-16887)

r/configure.win Outdated
Comment on lines 108 to 112
if [ $(cmake_option ARROW_GCS) -eq 1 ]; then
PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_GCS -DCURL_STATICLIB"
fi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be able to let pkg-config handle adding the flags and libraries for libcurl:

Suggested change
if [ $(cmake_option ARROW_GCS) -eq 1 ]; then
PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_GCS -DCURL_STATICLIB"
fi
if [ $(cmake_option ARROW_GCS) -eq 1 ]; then
PKG_CFLAGS="$PKG_CFLAGS -DARROW_R_WITH_GCS"
PKG_CONFIG_PACKAGES="$PKG_CONFIG_PACKAGES libcurl"
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind this might not help if libcurl is already statically linked in the arrow binaries.

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit -g r

@github-actions
Copy link

Revision: ee4e8e6

Submitted crossbow builds: ursacomputing/crossbow @ actions-fdfd53e5a1

Task Status
conda-linux-gcc-py37-cpu-r40 Azure
conda-linux-gcc-py37-cpu-r41 Azure
conda-osx-clang-py37-r40 Azure
conda-osx-clang-py37-r41 Azure
conda-win-vs2017-py37-r40 Azure
conda-win-vs2017-py37-r41 Azure
homebrew-r-autobrew Github Actions
homebrew-r-brew Github Actions
test-fedora-r-clang-sanitizer Azure
test-r-arrow-backwards-compatibility Github Actions
test-r-depsource-bundled Azure
test-r-depsource-system Github Actions
test-r-dev-duckdb Github Actions
test-r-devdocs Github Actions
test-r-gcc-11 Github Actions
test-r-gcc-12 Github Actions
test-r-install-local Github Actions
test-r-linux-as-cran Github Actions
test-r-linux-rchk Github Actions
test-r-linux-valgrind Azure
test-r-minimal-build Azure
test-r-offline-maximal Github Actions
test-r-offline-minimal Azure
test-r-rhub-debian-gcc-devel-lto-latest Azure
test-r-rhub-debian-gcc-release-custom-ccache Azure
test-r-rhub-ubuntu-gcc-release-latest Azure
test-r-rocker-r-base-latest Azure
test-r-rstudio-r-base-4.1-opensuse153 Azure
test-r-rstudio-r-base-4.2-centos7-devtoolset-8 Azure
test-r-rstudio-r-base-4.2-focal Azure
test-r-ubuntu-22.04 Github Actions
test-r-versions Github Actions
test-ubuntu-18.04-r-sanitizer Azure

@kou
Copy link
Member

kou commented Jun 25, 2022

MinGW 64 failure is something about boost in compiling the GCS filesystem tests: https://github.com/apache/arrow/runs/7054579528?check_suite_focus=true#step:7:1310

We need a preparation to use boost/process.hpp with Mingw-w64 like https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3_test_util.cc#L24-L38 . We can handle it in a follow-up task.

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit test-r-offline-maximal test-r-depsource-system homebrew-r-brew

@github-actions
Copy link

Revision: 226c31b

Submitted crossbow builds: ursacomputing/crossbow @ actions-5668d3fe2c

Task Status
homebrew-r-brew Github Actions
test-r-depsource-system Github Actions
test-r-offline-maximal Github Actions

@nealrichardson
Copy link
Member Author

MinGW 64 failure is something about boost in compiling the GCS filesystem tests: https://github.com/apache/arrow/runs/7054579528?check_suite_focus=true#step:7:1310

We need a preparation to use boost/process.hpp with Mingw-w64 like https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3_test_util.cc#L24-L38 . We can handle it in a follow-up task.

I made ARROW-16906 for that and added a TODO comment in the workflow.

@nealrichardson
Copy link
Member Author

@github-actions crossbow submit homebrew-r-brew

@github-actions
Copy link

Revision: a2a5e87

Submitted crossbow builds: ursacomputing/crossbow @ actions-06d32e61d7

Task Status
homebrew-r-brew Github Actions

@nealrichardson
Copy link
Member Author

Ok @kou we've resolved all of the builds now, this is ready if you want to give it a final review/+1.

@nealrichardson
Copy link
Member Author

One observation: the fully bundled Linux build (R / rhub/debian-gcc-devel:latest) seems to be 10-15 minutes slower with ARROW_GCS=ON. We may want to reconsider turning ARROW_GCS=ON without an explicit request to turn it on in the Linux source build (we can include it in the binaries), but that raises some other UX questions. cc @jonkeane

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

add_library(google-cloud-cpp::storage STATIC IMPORTED)
set_target_properties(google-cloud-cpp::storage
PROPERTIES IMPORTED_LOCATION
"${GOOGLE_CLOUD_CPP_STATIC_LIBRARY_STORAGE}"
INTERFACE_INCLUDE_DIRECTORIES
"${GOOGLE_CLOUD_CPP_INCLUDE_DIR}")
# Update this from https://github.com/googleapis/google-cloud-cpp/blob/main/google/cloud/storage/google_cloud_cpp_storage.cmake
set_property(TARGET google-cloud-cpp::storage
PROPERTY INTERFACE_LINK_LIBRARIES
google-cloud-cpp::common
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove this because google-cloud-cpp-rest-internal depends on google-cloud-cpp::common.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess so, but since it's copied from upstream, I'd rather leave it there for completeness unless it's harming anything.

cpp/cmake_modules/ThirdpartyToolchain.cmake Outdated Show resolved Hide resolved

arrow_built_with() {
# Function to check cmake options for features
grep -i 'set('"$1"' "ON")' $ARROW_OPTS_CMAKE >/dev/null 2>&1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: We can use grep -q ... instead of grep ... >/dev/null 2>&1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grep -q doesn't suppress stderr though, and I want this to be completely silent

Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
@nealrichardson nealrichardson merged commit 3ac0959 into apache:master Jun 26, 2022
@nealrichardson nealrichardson deleted the r-gcs branch June 26, 2022 13:43
djnavarro pushed a commit to djnavarro/arrow that referenced this pull request Jun 28, 2022
This adds basic bindings for GcsFileSystem to R, turns it on in the macOS, Windows, and Linux packaging (same handling as ARROW_S3), and basic R tests.

Followups: 

- Bindings for FromImpersonatedServiceAccount (ARROW-16885)
- Set up testbench for fuller tests, like how we do with minio (ARROW-16879)
- GcsFileSystem::Make should return Result (ARROW-16884)
- Explore auth integration/compatibility with `gargle`, `googleAuthR`, etc.: can we pick up the same credentials they use (ARROW-16880)
- macOS binary packaging: push dependencies upstream (ARROW-16883)
- Windows binary packaging: push dependencies upstream (ARROW-16878)
- Update cloud/filesystem documentation (ARROW-16887)

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Jul 5, 2022
This adds basic bindings for GcsFileSystem to R, turns it on in the macOS, Windows, and Linux packaging (same handling as ARROW_S3), and basic R tests.

Followups: 

- Bindings for FromImpersonatedServiceAccount (ARROW-16885)
- Set up testbench for fuller tests, like how we do with minio (ARROW-16879)
- GcsFileSystem::Make should return Result (ARROW-16884)
- Explore auth integration/compatibility with `gargle`, `googleAuthR`, etc.: can we pick up the same credentials they use (ARROW-16880)
- macOS binary packaging: push dependencies upstream (ARROW-16883)
- Windows binary packaging: push dependencies upstream (ARROW-16878)
- Update cloud/filesystem documentation (ARROW-16887)

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Sutou Kouhei <kou@clear-code.com>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants