Skip to content

Commit

Permalink
GH-35140: [R] Rewrite configure script and ensure we don't use mismat…
Browse files Browse the repository at this point in the history
…ched libarrow (#35147)

I've significantly rewritten `r/configure` to make it easier to reason about and harder for issues like #34229 and #35140 to happen. I've also added a version check to make sure that we don't obviously try to use a system C++ library that doesn't match the R package version. Making sure this was applied in all of the right places and handling what to do if the versions didn't match was the impetus for the whole refactor. 

`configure` has been broken up into some functions, and the flow of the script is, as is documented at the top of the file:

```
# * Find libarrow on the system. If it is present, make sure
#   that its version is compatible with the R package.
# * If no suitable libarrow is found, download it (where allowed)
#   or build it from source.
# * Determine what features this libarrow has and what other
#   flags it requires, and set them in src/Makevars for use when
#   compiling the bindings.
# * Run a test program to confirm that arrow headers are found
```

All of the detection of CFLAGS and `-L` dirs etc. happen in one place now, and they all prefer using `pkg-config` to read from the libarrow build what libraries and flags it requires, rather than hard-coding. (autobrew is the only remaining exception, but I didn't feel like messing with that today.) This should make the builds more future proof, should make it so more build configurations work (e.g. I suspect that a static build in ARROW_HOME wouldn't have gotten picked up correctly because it didn't add `-larrow_bundled_dependencies` to the libs, but now it will), and it may eliminate the redundant `-l` and `-D` setting I've observed in some builds (not harmful but definitely sloppy).

Version checking has been added in an R script for ease of testing (and for easier handling of arithmetic), and there is an accompanying `test-check-versions.R` added. These are run on all the builds that use `ci/scripts/r_test.sh`. 

### Behavior changes

* If libarrow is found on the system (via ARROW_HOME, pkg-config, or brew), but the version does not match, it will not be used, and we will try a bundled build. This should mean that users installing a released version will never have libarrow version problems. 
* If both the found C++ library and R package are on matching dev versions (i.e. not identical given the x.y.z.9000 vs x+1.y.z-SNAPSHOT difference), it will proceed with a warning that you may need to rebuild if there are issues. This means that regular developers will see an extra message in the build output.
* autobrew is only used on a release version unless you set FORCE_AUTOBREW=true. This eliminates another source of version mismatches (C++ release version, R dev version).
* The path where you could set `LIB_DIR` and `INCLUDE_DIR` env vars has been removed. Use `ARROW_HOME` instead.

* Closes: #35140
* Closes: #31989

Lead-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
  • Loading branch information
nealrichardson and kou authored May 3, 2023
1 parent f794b20 commit ec89360
Show file tree
Hide file tree
Showing 13 changed files with 502 additions and 275 deletions.
4 changes: 2 additions & 2 deletions dev/tasks/conda-recipes/r-arrow/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ requirements:

test:
commands:
- $R -e "library('arrow')" # [not win]
- "\"%R%\" -e \"library('arrow'); data(mtcars); write_parquet(mtcars, 'test.parquet')\"" # [win]
- $R -e "library('arrow'); stopifnot(arrow_with_acero(), arrow_with_dataset(), arrow_with_parquet(), arrow_with_s3())" # [not win]
- "\"%R%\" -e \"library('arrow'); stopifnot(arrow_with_acero(), arrow_with_dataset(), arrow_with_parquet(), arrow_with_s3())\"" # [win]

about:
home: https://github.com/apache/arrow
Expand Down
1 change: 1 addition & 0 deletions dev/tasks/r/github.macos.autobrew.yml
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ jobs:
NOT_CRAN: true
ARROW_USE_PKG_CONFIG: false
ARROW_R_DEV: true
FORCE_AUTOBREW: true
{{ macros.github_set_sccache_envvars()|indent(8)}}
run: arrow/ci/scripts/r_test.sh arrow
- name: Dump install logs
Expand Down
3 changes: 2 additions & 1 deletion dev/tasks/r/github.packages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,8 @@ jobs:
shell: Rscript {0}
env:
NOT_CRAN: "true" # actions/setup-r sets this implicitly
ARROW_R_DEV: TRUE
ARROW_R_DEV: "true"
FORCE_AUTOBREW: "true" # this is ignored on windows
# sccache for macos
{{ macros.github_set_sccache_envvars()|indent(8) }}
run: |
Expand Down
2 changes: 1 addition & 1 deletion r/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ doc: style
-git add --all man/*.Rd

test:
export ARROW_R_DEV=$(ARROW_R_DEV) && R CMD INSTALL --install-tests --no-test-load --no-docs --no-help --no-byte-compile .
export NOT_CRAN=true && export ARROW_R_DEV=$(ARROW_R_DEV) && R CMD INSTALL --install-tests --no-test-load --no-docs --no-help --no-byte-compile .
export NOT_CRAN=true && export ARROW_R_DEV=$(ARROW_R_DEV) && export AWS_EC2_METADATA_DISABLED=TRUE && export ARROW_LARGE_MEMORY_TESTS=$(ARROW_LARGE_MEMORY_TESTS) && R -s -e 'library(testthat); setwd(file.path(.libPaths()[1], "arrow", "tests")); system.time(test_check("arrow", filter="${file}", reporter=ifelse(nchar("${r}"), "${r}", "summary")))'

deps:
Expand Down
555 changes: 316 additions & 239 deletions r/configure

Large diffs are not rendered by default.

8 changes: 7 additions & 1 deletion r/inst/build_arrow_static.sh
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,13 @@ set -x
SOURCE_DIR="$(cd "${SOURCE_DIR}" && pwd)"
DEST_DIR="$(mkdir -p "${DEST_DIR}" && cd "${DEST_DIR}" && pwd)"

: ${N_JOBS:="$(nproc)"}
if [ "$N_JOBS" = "" ]; then
if [ "`uname -s`" = "Darwin" ]; then
N_JOBS="$(sysctl -n hw.logicalcpu)"
else
N_JOBS="$(nproc)"
fi
fi

# Make some env vars case-insensitive
if [ "$LIBARROW_MINIMAL" != "" ]; then
Expand Down
59 changes: 59 additions & 0 deletions r/tools/check-versions.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

args <- commandArgs(TRUE)

# TESTING is set in test-check-version.R; it won't be set when called from configure
test_mode <- exists("TESTING")

check_versions <- function(r_version, cpp_version) {
r_parsed <- package_version(r_version)
r_dev_version <- r_parsed[1, 4]
r_is_dev <- !is.na(r_dev_version) && r_dev_version > 100
cpp_is_dev <- grepl("SNAPSHOT$", cpp_version)
cpp_parsed <- package_version(sub("-SNAPSHOT$", "", cpp_version))

major <- function(x) as.numeric(x[1, 1])
# R and C++ denote dev versions differently
# R is current.release.9000, C++ is next.release-SNAPSHOT
# So a "match" is if the R major version + 1 = C++ major version
if (r_is_dev && cpp_is_dev && major(r_parsed) + 1 == major(cpp_parsed)) {
msg <- c(
sprintf("*** > Packages are both on development versions (%s, %s)", cpp_version, r_version),
"*** > If installation fails, rebuild the C++ library to match the R version",
"*** > or retry with FORCE_BUNDLED_BUILD=true"
)
cat(paste0(msg, "\n", collapse = ""))
} else if (r_version != cpp_version) {
cat(
sprintf(
"**** Not using: C++ library version (%s) does not match R package (%s)\n",
cpp_version,
r_version
)
)
stop("version mismatch")
# Add ALLOW_VERSION_MISMATCH env var to override stop()? (Could be useful for debugging)
} else {
# OK
cat(sprintf("**** C++ and R library versions match: %s\n", cpp_version))
}
}

if (!test_mode) {
check_versions(args[1], args[2])
}
30 changes: 26 additions & 4 deletions r/tools/nixlibs.R
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,17 @@ try_download <- function(from_url, to_file, hush = quietly) {
!inherits(status, "try-error") && status == 0
}

not_cran <- env_is("NOT_CRAN", "true")
if (not_cran) {
# Set more eager defaults
if (env_is("LIBARROW_BINARY", "")) {
Sys.setenv(LIBARROW_BINARY = "true")
}
if (env_is("LIBARROW_MINIMAL", "")) {
Sys.setenv(LIBARROW_MINIMAL = "false")
}
}

# For local debugging, set ARROW_R_DEV=TRUE to make this script print more
quietly <- !env_is("ARROW_R_DEV", "true")

Expand Down Expand Up @@ -374,11 +385,14 @@ build_libarrow <- function(src_dir, dst_dir) {
# We'll need to compile R bindings with these libs, so delete any .o files
system("rm src/*.o", ignore.stdout = TRUE, ignore.stderr = TRUE)
# Set up make for parallel building
# CRAN policy says not to use more than 2 cores during checks
# If you have more and want to use more, set MAKEFLAGS or NOT_CRAN
ncores <- parallel::detectCores()
if (!not_cran) {
ncores <- min(ncores, 2)
}
makeflags <- Sys.getenv("MAKEFLAGS")
if (makeflags == "") {
# CRAN policy says not to use more than 2 cores during checks
# If you have more and want to use more, set MAKEFLAGS
ncores <- min(parallel::detectCores(), 2)
makeflags <- sprintf("-j%s", ncores)
Sys.setenv(MAKEFLAGS = makeflags)
}
Expand Down Expand Up @@ -416,8 +430,16 @@ build_libarrow <- function(src_dir, dst_dir) {
CC = sub("^.*ccache", "", R_CMD_config("CC")),
CXX = paste(sub("^.*ccache", "", R_CMD_config("CXX17")), R_CMD_config("CXX17STD")),
# CXXFLAGS = R_CMD_config("CXX17FLAGS"), # We don't want the same debug symbols
LDFLAGS = R_CMD_config("LDFLAGS")
LDFLAGS = R_CMD_config("LDFLAGS"),
N_JOBS = ncores
)

dep_source <- Sys.getenv("ARROW_DEPENDENCY_SOURCE")
if (dep_source %in% c("", "AUTO") && !nzchar(Sys.which("pkg-config"))) {
cat("**** pkg-config not installed, setting ARROW_DEPENDENCY_SOURCE=BUNDLED\n")
env_var_list <- c(env_var_list, ARROW_DEPENDENCY_SOURCE = "BUNDLED")
}

env_var_list <- with_cloud_support(env_var_list)

# turn_off_all_optional_features() needs to happen after
Expand Down
62 changes: 62 additions & 0 deletions r/tools/test-check-versions.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.


# Usage: run testthat::test_dir(".") inside of this directory

# Flag so that we just load the functions and don't evaluate them like we do
# when called from configure.R
TESTING <- TRUE

source("check-versions.R", local = TRUE)

test_that("check_versions", {
expect_output(
check_versions("10.0.0", "10.0.0"),
"**** C++ and R library versions match: 10.0.0",
fixed = TRUE
)
expect_output(
expect_error(
check_versions("10.0.0", "10.0.0-SNAPSHOT"),
"version mismatch"
),
"**** Not using: C++ library version (10.0.0-SNAPSHOT) does not match R package (10.0.0)",
fixed = TRUE
)
expect_output(
expect_error(
check_versions("10.0.0.9000", "10.0.0-SNAPSHOT"),
"version mismatch"
),
"**** Not using: C++ library version (10.0.0-SNAPSHOT) does not match R package (10.0.0.9000)",
fixed = TRUE
)
expect_output(
expect_error(
check_versions("10.0.0.9000", "10.0.0"),
"version mismatch"
),
"**** Not using: C++ library version (10.0.0) does not match R package (10.0.0.9000)",
fixed = TRUE
)
expect_output(
check_versions("10.0.0.9000", "11.0.0-SNAPSHOT"),
"*** > Packages are both on development versions (11.0.0-SNAPSHOT, 10.0.0.9000)\n",
fixed = TRUE
)
})
42 changes: 24 additions & 18 deletions r/vignettes/developers/install_details.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -35,14 +35,14 @@ handle finding the libarrow, setting up the build variables necessary, and
writing the package Makevars file that is used to compile the C++ code in the R
package.

* `tools/nixlibs.R` - this script is sometimes called by `configure` on Linux
* `tools/nixlibs.R` - this script is called by `configure` on Linux
(or on any non-windows OS with the environment variable
`FORCE_BUNDLED_BUILD=true`) if an existing libarrow installation cannot be found.
This sets up the build process for our bundled builds (which is the default on
linux) and checks for binaries or downloads libarrow from source depending on
dependency availability and build configuration.

* `tools/winlibs.R` - this script is sometimes called by `configure.win` on Windows
* `tools/winlibs.R` - this script is called by `configure.win` on Windows
when environment variable `ARROW_HOME` is not set. It looks for an existing libarrow
installation, and if it can't find one downloads an appropriate libarrow binary.

Expand Down Expand Up @@ -74,24 +74,30 @@ If no existing libarrow installations can be found, the script proceeds to try t

### Non-Windows

The diagram below shows how the R package finds a libarrow installation on non-Windows systems.
On Linux and macOS, the core logic is:

```{r, echo=FALSE, out.width="70%", fig.alt = "Flowchart of libarrow installation on non-Windows systems - find full description in sections 'Using pkg-config', 'Prebuilt binaries' and 'Building from source' below"}
knitr::include_graphics("./install_nix.png")
```
1. If `FORCE_BUNDLED_BUILD=true`, skip to step 3.
2. Find libarrow on the system. If it is present, make sure that its version
is compatible with the R package.
3. If no suitable libarrow is found, download it (where allowed) or build it from source.
4. Determine what features this libarrow has and what other flags it requires,
and set them in `src/Makevars` for use when compiling the bindings.

More information about these steps can be found below.
#### Finding libarrow on the system

#### Using pkg-config
The `configure` script will look for libarrow in three places:

When you install the arrow R package on non-Windows systems, if no environment variables
relating to the location of an existing libarrow installation have already by
set, the installation code will attempt to find libarrow on
your system using the `pkg-config` command.
1. The path in environment variable `ARROW_HOME`, if set
2. Whatever `pkg-config` finds, unless `ARROW_USE_PKG_CONFIG=false`
3. Homebrew, if you have done `brew install apache-arrow`

This will find either installed system packages or libraries you've built yourself.
In order for `install.packages("arrow")` to work with these system packages,
you'll need to install them before installing the R package.
If a libarrow build is found, it will then check that the version of that C++ library
matches that of the R package. If the versions do not match, like when you've installed
a system package for a release version but you have a development version of the
R package, that libarrow will not be used. If both the C++ library and R package are
on development versions, you will see a warning message advising you that if you do have
trouble, you should ensure that the C++ library was built from the same commit as the R
package, as development version numbers do not change with every commit.

#### Prebuilt binaries

Expand All @@ -108,7 +114,7 @@ downloaded and bundled when your R package compiles.

#### Building from source

If no libarrow binary is found, it will attempt to build it locally.
If no suitable libarrow binary is found, it will attempt to build it locally.
First, it will also look to see if you are in a checkout of the `apache/arrow`
git repository and thus have the libarrow source files there.
Otherwise, it builds from the source files included in the package.
Expand All @@ -126,8 +132,8 @@ See the [Arrow project installation page](https://arrow.apache.org/install/)
to find pre-compiled binary packages for some common Linux distributions,
including Debian, Ubuntu, and CentOS.

Generally, we do not recommend this method of working with libarrow with the R
package unless you have a specific reason to do so.
If you are a developer contributing to the R package, system libarrow packages won't
be useful because the versions will not match.

# Using the R package with an existing libarrow build

Expand Down
Binary file removed r/vignettes/developers/install_nix.png
Binary file not shown.
9 changes: 1 addition & 8 deletions r/vignettes/install.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -520,17 +520,10 @@ so that we can improve the script.

* On CentOS, building the package requires a more modern `devtoolset` than the default system compilers. See "System dependencies" above.

* If you have multiple versions of `zstd` installed on your system,
installation by building libarrow from source may fail with an "undefined symbols"
error. Workarounds include (1) setting `LIBARROW_BINARY` to use a C++ binary; (2)
setting `ARROW_WITH_ZSTD=OFF` to build without `zstd`; or (3) uninstalling
the conflicting `zstd`.
See discussion [here](https://issues.apache.org/jira/browse/ARROW-8556).

## Contributing

We are constantly working to make the installation process as painless as
possible. If you find ways to improve the process, please [report an issue](https://issues.apache.org/jira/projects/ARROW/issues) so that we can
possible. If you find ways to improve the process, please [report an issue](https://github.com/apache/arrow/issues) so that we can
document it. Similarly, if you find that your Linux distribution
or version is not supported, we would welcome the contribution of Docker
images (hosted on Docker Hub) that we can use in our continuous integration
Expand Down
2 changes: 1 addition & 1 deletion r/vignettes/install_nightly.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ R CMD INSTALL .

If you don't already have libarrow on your system,
when installing the R package from source, it will also download and build
libarrow for you. See the section above on build environment
libarrow for you. See the links below for build environment
variables for options for configuring the build source and enabled features.

## Further reading
Expand Down

0 comments on commit ec89360

Please sign in to comment.