Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] v4.0.0 CRAN submission issues #5987

Closed
jameslamb opened this issue Jul 18, 2023 · 24 comments · Fixed by #5988
Closed

[R-package] v4.0.0 CRAN submission issues #5987

jameslamb opened this issue Jul 18, 2023 · 24 comments · Fixed by #5988

Comments

@jameslamb
Copy link
Collaborator

jameslamb commented Jul 18, 2023

Description

The recent submission of {lightgbm} v4.0.0 to CRAN was rejected because of the following issues detected by R CMD check.

* checking examples ... [8s/4s] NOTE
Examples with CPU time > 2.5 times elapsed time
                    user system elapsed ratio
lgb.restore_handle 1.819  0.128   0.165  11.8

* checking S3 generic/method consistency ... NOTE
Mismatches for apparent methods not registered:
merge:
  function(x, y, ...)
merge.eval.string:
  function(env)

format:
  function(x, ...)
format.eval.string:
  function(eval_res, eval_err)
See section 'Registering S3 methods' in the 'Writing R Extensions'
manual.

These need to be fixed for CRAN to accept our submission.

Reproducible example

These check failures happened during the CRAN incoming checks.

Environment info

LightGBM version or commit hash: v4.0.0.

Additional Comments

I will put up a pull request shortly to fix these things. We can figure out later how to catch them in CI so they don't surprise us on CRAN submissions in the future.

The examples timing issue is because I did not complete #5102 as part of this release, sorry 😞 . I said I would almost a year ago (#5102 (comment)), just forgot about it.

@jameslamb
Copy link
Collaborator Author

Re-opening this. Sadly our v4.0.0.1 submission still did not resolve the issue about using more than 2 threads in examples and tests, and was not accepted to CRAN.

@jameslamb jameslamb reopened this Jul 21, 2023
@jameslamb jameslamb mentioned this issue Jul 21, 2023
19 tasks
@mayer79
Copy link
Contributor

mayer79 commented Jul 24, 2023

Regarding the timing: in the worst case, pack the firing examples in a dontrun block:

@examples
#' \dontrun{
#' do some lgb magic
#' }

@jameslamb
Copy link
Collaborator Author

jameslamb commented Jul 24, 2023

Thanks for the suggestion @mayer79 ! But I don't think that's the fix we should pursue.

I don't think we can safely assume it's only the one example that happened to be reported in CRAN's initial checks. The timings are unpredictable and depend on, for example, what else happens to be running on the CRAN check machines at the same time as they're checking {lightgbm}.

In addition...I no longer trust \dontrun or \donttest mechanisms as a way to avoid issues in the CRAN checks.

We used to wrap all examples in \dontrun (see #3270)... until CRAN manual review by a human told us we had to drop that in favor of \donttest (#3338).

... and then with R 4.0, R CMD check started running all the \donttest{} examples anyway 🙃

See https://cran.r-project.org/doc/manuals/r-devel/NEWS.html

*R CMD check --as-cran now runs \donttest examples

We have to find a more permanent solution to ensure that everything multi-threaded in this project uses at most 2 threads in tests and examples. It's frustrating that v4.0.0 of the project is being blocked from CRAN because of an example that takes less than 2 seconds to run but... we can only work on the things we can control.

@mayer79
Copy link
Contributor

mayer79 commented Jul 24, 2023

Ha, darn. I recently had the same problem: Setting all num_threads in all examples in https://github.com/ModelOriented/shapviz would not bring down those example runtimes. Then I just packed them into {dontrun}.

The issue about merge.* is already solved, thx!

@david-cortes
Copy link
Contributor

Regarding the issue with the threads on lgb.restore_handle - given that bug #4705 hasn't been addressed, I guess it could be solved by passing num_threads=0, or by a few calls to RhpcBLASctl (which is already under Suggests).

I also don't think that'd be grounds for rejection (the one about S3 mismatch would though) since, when checks fail due to large thread usage, it typically manifests as a segmentation fault on the solaris checks (or at least it does when passing the numer of threads on individual pragmas, which this library doesn't).

@jameslamb
Copy link
Collaborator Author

I also don't think that'd be grounds for rejection

That is not correct. Our second attempted submission was rejected by the automatic CRAN checks, and the only issue was that NOTE about the example timings.

@jameslamb
Copy link
Collaborator Author

There has been significant discussion about this recently on the R-pkg-devel mailing list. Many package authors have been facing CRAN's newly-strict insistence on projects only using 2 threads in all places.

I'm still reading through all of them, so won't attempt to summarize yet... just linking these to make it clear this is not a LightGBM-specific issue.

@jameslamb
Copy link
Collaborator Author

Additionally, see this recent comment on an old {data.table} issue where one maintainer of a package on CRAN reported being told that OMP_THREAD_LIMIT environment variable is no longer set on CRAN:

Rdatatable/data.table#3300 (comment)

@david-cortes
Copy link
Contributor

Note that those discussions from data.table are around violations of the CRAN policy that are caused by calling functions and methods from the data.table package (which are multi-threaded by default) in examples and tests of packages that have data.table as a dependency.

The issue here is not coming from usage of data.table in the lightgbm package, but from direct usage of OpenMP pragmas with the global config instead of using a local num_threads clause.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Sep 4, 2023

That's not entirely true. That conversation has also continued the r-pkg-devel mailing list conversation about:

@trivialfis
Copy link

XGBoost release is running into similar issues with system time constraints: dmlc/xgboost#9497 (comment)

We actually test this https://github.com/dmlc/xgboost/blob/f90d034a86784f4f07417d1d28a77b4a189acc89/tests/ci_build/test_r_package.py#L109 , but I haven't been able to reproduce the error.

@simonpcouch
Copy link

Appreciate the tag on the bundle issue! Just wanted to drop our approach to skipping long-running or dependency-heavy examples conditionally here in case it's useful for yall. In some of our packages, we define a helper that tells us whether we ought to run long-running or dependency-heavy examples, used like so. The is_cran_check() definition assumes that we check via devtools::check() rather than via R CMD check, and would need adaptation if that's not yall's workflow.

Definitely not ideal, though somewhat more resilient to CRAN check environments than \dontrun and \donttest.

@jameslamb
Copy link
Collaborator Author

I'm actively working on this. Just sharing my notes as I go through it. Before putting up potential fixes, I'm first trying to find a way to reproduce the timing-related warnings.

On my macbook pro (1 Dual-Core Intel Core i5 CPU), I gave Docker for Mac access to 4 CPUs then ran the following from the root of this repo.

docker run \
    --rm \
    --cpus 4 \
    --env MAKEFLAGS=-j2 \
    -v $(pwd):/opt/LightGBM \
    -w /opt/LightGBM \
    -it rocker/verse:4.3.1 \
    /bin/bash

Installed OpenMP in there.

apt-get update
apt-get install -y \
    libomp-dev

And the one Suggests dependency of {lightgbm} not already available in that image.

Rscript \
    --vanilla \
    -e "install.packages(c('RhpcBLASctl'), repos = 'https://cran.r-project.org')"

Then built a CRAN-style source distribution of the R package, and moved it to a new directory (not on the path that's mounted, so reads and writes are faster).

mkdir -p /opt/r-check
sh build-cran-package.sh --no-build-vignettes
cp ./lightgbm_4.1.0.99.tar.gz /opt/r-check/
cd /opt/r-check

Approach 1: R CMD check

Then ran R CMD check with some customizations to force it to print all the timings.

OMP_NUM_THREADS=2 \
_R_CHECK_EXAMPLE_TIMING_THRESHOLD_=0 \
_R_CHECK_EXAMPLE_TIMING_CPU_TO_ELAPSED_THRESHOLD_=2.5 \
R --vanilla CMD check \
    --no-codoc \
    --no-manual \
    --no-tests \
    --no-vignettes \
    --run-dontrun \
    --run-donttest \
    --timings \
    ./lightgbm_4.1.0.99.tar.gz

For a description of what those flags mean, see dmlc/xgboost#9497 (comment).

results with `OMP_NUM_THREADS=1` (click me)
readRDS.lgb.Booster         0.531  0.015   0.405
lgb.plot.interpretation     0.438  0.003   0.410
lgb.interprete              0.285  0.002   0.255
lgb.cv                      0.270  0.005   0.174
lgb.model.dt.tree           0.246  0.005   0.193
lgb.importance              0.211  0.002   0.165
lgb.load                    0.154  0.010   0.105
saveRDS.lgb.Booster         0.145  0.001   0.093
lgb.train                   0.143  0.002   0.107
predict.lgb.Booster         0.133  0.003   0.084
lgb.plot.importance         0.128  0.000   0.097
lgb.save                    0.117  0.008   0.080
lgb.restore_handle          0.111  0.002   0.081
lgb.configure_fast_predict  0.100  0.001   0.064
lgb.get.eval.result         0.100  0.000   0.067
lgb.dump                    0.090  0.000   0.056
set_field                   0.087  0.000   0.080
lgb.Dataset.create.valid    0.053  0.026   0.079
get_field                   0.066  0.003   0.069
lgb.Dataset                 0.053  0.010   0.064
dimnames.lgb.Dataset        0.056  0.003   0.059
slice                       0.055  0.000   0.054
dim                         0.049  0.004   0.053
lgb.Dataset.set.categorical 0.049  0.001   0.049
lgb.Dataset.set.reference   0.048  0.000   0.049
lgb.Dataset.save            0.046  0.001   0.047
lgb.Dataset.construct       0.040  0.004   0.043
lgb.convert_with_rules      0.020  0.000   0.020
results with `OMP_NUM_THREADS=2` (click me)
Examples with CPU (user + system) or elapsed time > 0s
                             user system elapsed
lgb.plot.interpretation     0.442  0.003   0.329
lgb.interprete              0.322  0.002   0.215
lgb.cv                      0.192  0.004   0.124
lgb.importance              0.191  0.001   0.134
lgb.model.dt.tree           0.172  0.001   0.126
lgb.train                   0.138  0.004   0.107
lgb.plot.importance         0.125  0.000   0.090
lgb.load                    0.100  0.008   0.066
saveRDS.lgb.Booster         0.106  0.002   0.063
predict.lgb.Booster         0.104  0.002   0.065
lgb.Dataset.create.valid    0.071  0.033   0.078
dim                         0.096  0.007   0.102
readRDS.lgb.Booster         0.100  0.002   0.064
lgb.restore_handle          0.091  0.004   0.064
lgb.configure_fast_predict  0.088  0.001   0.054
lgb.dump                    0.086  0.003   0.056
lgb.save                    0.087  0.002   0.057
lgb.get.eval.result         0.082  0.004   0.055
lgb.Dataset                 0.071  0.010   0.052
slice                       0.078  0.001   0.048
dimnames.lgb.Dataset        0.074  0.004   0.067
get_field                   0.074  0.003   0.055
set_field                   0.075  0.001   0.050
lgb.Dataset.set.categorical 0.059  0.003   0.040
lgb.Dataset.save            0.059  0.001   0.040
lgb.Dataset.construct       0.057  0.001   0.038
lgb.Dataset.set.reference   0.046  0.001   0.033
lgb.convert_with_rules      0.033  0.002   0.022
results with `OMP_NUM_THREADS=3` (click me)
                             user system elapsed
lgb.plot.interpretation     0.463  0.000   0.331
lgb.interprete              0.356  0.004   0.234
lgb.cv                      0.215  0.002   0.138
lgb.importance              0.209  0.002   0.151
lgb.model.dt.tree           0.179  0.000   0.134
lgb.Dataset.create.valid    0.115  0.032   0.086
lgb.plot.importance         0.125  0.000   0.092
lgb.train                   0.118  0.003   0.089
readRDS.lgb.Booster         0.118  0.003   0.066
lgb.load                    0.109  0.010   0.072
lgb.Dataset                 0.110  0.007   0.052
slice                       0.115  0.000   0.052
saveRDS.lgb.Booster         0.101  0.000   0.062
lgb.configure_fast_predict  0.099  0.001   0.054
predict.lgb.Booster         0.097  0.002   0.058
lgb.get.eval.result         0.096  0.002   0.062
get_field                   0.095  0.001   0.053
set_field                   0.095  0.000   0.048
lgb.dump                    0.090  0.004   0.058
lgb.save                    0.089  0.005   0.059
lgb.restore_handle          0.083  0.002   0.058
lgb.Dataset.save            0.076  0.004   0.040
lgb.Dataset.set.categorical 0.076  0.003   0.040
lgb.Dataset.construct       0.071  0.001   0.038
dimnames.lgb.Dataset        0.065  0.003   0.045
lgb.Dataset.set.reference   0.057  0.002   0.034
dim                         0.044  0.002   0.046
lgb.convert_with_rules      0.043  0.003   0.021
results with `OMP_NUM_THREADS=4` (click me)
* checking examples ... OK
Examples with CPU (user + system) or elapsed time > 0s
                             user system elapsed
lgb.plot.interpretation     0.456  0.004   0.320
lgb.interprete              0.359  0.000   0.223
lgb.cv                      0.213  0.000   0.136
dimnames.lgb.Dataset        0.211  0.000   0.135
lgb.importance              0.196  0.000   0.140
lgb.Dataset.create.valid    0.186  0.002   0.101
lgb.model.dt.tree           0.181  0.001   0.131
readRDS.lgb.Booster         0.160  0.006   0.074
lgb.Dataset                 0.156  0.000   0.056
slice                       0.146  0.005   0.054
lgb.plot.importance         0.139  0.001   0.102
get_field                   0.139  0.000   0.064
dim                         0.138  0.000   0.139
lgb.configure_fast_predict  0.122  0.004   0.067
lgb.train                   0.119  0.002   0.088
set_field                   0.121  0.000   0.060
lgb.load                    0.108  0.005   0.069
saveRDS.lgb.Booster         0.102  0.000   0.063
predict.lgb.Booster         0.100  0.000   0.063
lgb.Dataset.set.categorical 0.097  0.001   0.039
lgb.Dataset.construct       0.095  0.000   0.039
lgb.dump                    0.094  0.000   0.058
lgb.save                    0.088  0.003   0.056
lgb.Dataset.set.reference   0.085  0.004   0.037
lgb.restore_handle          0.088  0.001   0.061
lgb.get.eval.result         0.086  0.000   0.054
lgb.Dataset.save            0.079  0.004   0.036
lgb.convert_with_rules      0.056  0.006   0.024

(CPU time being around 2x elapsed time is expected for many of these examples, since we hardcoded num_threads = 2 into a bunch of the examples in #5988 in an earlier attempt to fix this)

Approach 2: directly running the examples

Following @trivialfis 's excellent work from dmlc/xgboost#9497 (comment), I tried just installing the package and running the examples directly.

output of doing that (click me)
R --vanilla CMD INSTALL ./lightgbm_4.1.0.99.tar.gz

cd /opt/LightGBM

cat << EOF > check-examples.R
library(pkgload)
library(lightgbm)

files <- list.files(
    "./R-package/man"
    , pattern="*.Rd"
    , full.names = TRUE
)

run_example_timeit <- function(f) {
  print(paste("Test", f))
  flush.console()
  t0 <- proc.time()
  capture.output({
       pkgload::run_example(f, run_donttest = TRUE, quiet = TRUE)
   })
  example_timing = proc.time() - t0
  print(example_timing)
}

timings <- lapply(files, run_example_timeit)
EOF

OMP_NUM_THREADS=2 \
Rscript --vanilla ./check-examples.R

That printed the following:

[1] "Test ./R-package/man/agaricus.test.Rd"
   user  system elapsed
  0.018   0.004   0.035
[1] "Test ./R-package/man/agaricus.train.Rd"
   user  system elapsed
  0.002   0.005   0.012
[1] "Test ./R-package/man/bank.Rd"
   user  system elapsed
  0.002   0.002   0.007
[1] "Test ./R-package/man/dim.Rd"
   user  system elapsed
  0.048   0.005   0.056
[1] "Test ./R-package/man/dimnames.lgb.Dataset.Rd"
   user  system elapsed
  0.049   0.005   0.049
[1] "Test ./R-package/man/get_field.Rd"
   user  system elapsed
  0.091   0.002   0.065
[1] "Test ./R-package/man/lgb_shared_dataset_params.Rd"
   user  system elapsed
  0.009   0.001   0.007
[1] "Test ./R-package/man/lgb_shared_params.Rd"
   user  system elapsed
  0.009   0.002   0.007
[1] "Test ./R-package/man/lgb.configure_fast_predict.Rd"
   user  system elapsed
  0.114   0.006   0.088
[1] "Test ./R-package/man/lgb.convert_with_rules.Rd"
   user  system elapsed
  0.042   0.003   0.041
[1] "Test ./R-package/man/lgb.cv.Rd"
   user  system elapsed
  0.183   0.003   0.122
[1] "Test ./R-package/man/lgb.Dataset.construct.Rd"
   user  system elapsed
  0.068   0.001   0.051
[1] "Test ./R-package/man/lgb.Dataset.create.valid.Rd"
   user  system elapsed
  0.090   0.034   0.098
[1] "Test ./R-package/man/lgb.Dataset.Rd"
   user  system elapsed
  0.077   0.006   0.059
[1] "Test ./R-package/man/lgb.Dataset.save.Rd"
   user  system elapsed
  0.069   0.001   0.051
[1] "Test ./R-package/man/lgb.Dataset.set.categorical.Rd"
   user  system elapsed
  0.067   0.003   0.051
[1] "Test ./R-package/man/lgb.Dataset.set.reference.Rd"
   user  system elapsed
  0.056   0.001   0.047
[1] "Test ./R-package/man/lgb.drop_serialized.Rd"
   user  system elapsed
  0.002   0.001   0.007
[1] "Test ./R-package/man/lgb.dump.Rd"
   user  system elapsed
  0.089   0.002   0.073
[1] "Test ./R-package/man/lgb.get.eval.result.Rd"
   user  system elapsed
  0.094   0.002   0.066
[1] "Test ./R-package/man/lgb.importance.Rd"
   user  system elapsed
  0.221   0.003   0.175
[1] "Test ./R-package/man/lgb.interprete.Rd"
   user  system elapsed
  0.337   0.004   0.236
[1] "Test ./R-package/man/lgb.load.Rd"
   user  system elapsed
  0.108   0.009   0.076
[1] "Test ./R-package/man/lgb.make_serializable.Rd"
   user  system elapsed
  0.019   0.002   0.014
[1] "Test ./R-package/man/lgb.model.dt.tree.Rd"
   user  system elapsed
  0.164   0.003   0.132
[1] "Test ./R-package/man/lgb.plot.importance.Rd"
   user  system elapsed
  0.184   0.001   0.145
[1] "Test ./R-package/man/lgb.plot.interpretation.Rd"
   user  system elapsed
  0.405   0.003   0.312
[1] "Test ./R-package/man/lgb.restore_handle.Rd"
   user  system elapsed
  0.103   0.003   0.075
[1] "Test ./R-package/man/lgb.save.Rd"
   user  system elapsed
  0.124   0.002   0.092
[1] "Test ./R-package/man/lgb.train.Rd"
   user  system elapsed
  0.104   0.002   0.074
[1] "Test ./R-package/man/lightgbm.Rd"
   user  system elapsed
  0.011   0.000   0.007
[1] "Test ./R-package/man/predict.lgb.Booster.Rd"
   user  system elapsed
  0.100   0.001   0.072
[1] "Test ./R-package/man/print.lgb.Booster.Rd"
   user  system elapsed
  0.019   0.001   0.013
[1] "Test ./R-package/man/readRDS.lgb.Booster.Rd"
   user  system elapsed
  0.110   0.001   0.081
[1] "Test ./R-package/man/saveRDS.lgb.Booster.Rd"
   user  system elapsed
  0.116   0.004   0.081
[1] "Test ./R-package/man/set_field.Rd"
   user  system elapsed
  0.086   0.000   0.064
[1] "Test ./R-package/man/slice.Rd"
   user  system elapsed
  0.101   0.000   0.066
[1] "Test ./R-package/man/summary.lgb.Booster.Rd"
   user  system elapsed
  0.004   0.002   0.010

Other things I checked

No OpenMP environment variables werre set in the container.

env | grep -i omp
# (empty)

All 4 logical CPUs I was giving to Docker appeared to be visible to the R process.

Rscript \
    --vanilla \
    -e "print(RhpcBLASctl::omp_get_max_threads())"
# [1] 4

Next steps

When I return to this in the next few days, I'll try repeating these approaches on a machine with more physical CPUs. I'm hoping with something like a 12-core or 24-core machine, the sources of unintended parallelism will really show up... and then we'll be able to test fixes.

@jameslamb
Copy link
Collaborator Author

jameslamb commented Oct 8, 2023

I repeated steps similar to the ones above tonight on a VM from AWS with the following specs:

  • instance type: c5a.4xlarge
  • vCPUs: 16
  • RAM: 32GB
  • disk storage (attached EBS volume): 32GB
  • operating system: Ubuntu 22.04

And installed the following on it:

  • R: 4.3.1
  • clang: 14.0.0
  • gcc: 11.4.0
  • libomp-dev: 14.0
setup commands (click me)
sudo apt-get update
sudo apt-get install --no-install-recommends -y \
    software-properties-common

sudo apt-get install --no-install-recommends -y \
    apt-utils \
    build-essential \
    ca-certificates \
    cmake \
    curl \
    git \
    iputils-ping \
    jq \
    libcurl4 \
    libicu-dev \
    libomp-dev \
    libssl-dev \
    libunwind8 \
    locales \
    locales-all \
    netcat \
    unzip \
    zip

export LANG="en_US.UTF-8"
sudo update-locale LANG=${LANG}
export LC_ALL="${LANG}"

# set up R environment
export CRAN_MIRROR="https://cran.rstudio.com"
export MAKEFLAGS=-j8
export R_LIB_PATH=~/Rlib
export R_LIBS=$R_LIB_PATH
export PATH="$R_LIB_PATH/R/bin:$PATH"
export R_APT_REPO="jammy-cran40/"
export R_LINUX_VERSION="4.3.1-1.2204.0"

mkdir -p $R_LIB_PATH

mkdir -p ~/.gnupg
echo "disable-ipv6" >> ~/.gnupg/dirmngr.conf
sudo apt-key adv \
    --homedir ~/.gnupg \
    --keyserver keyserver.ubuntu.com \
    --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9

sudo add-apt-repository \
    "deb ${CRAN_MIRROR}/bin/linux/ubuntu ${R_APT_REPO}"

sudo apt-get update
sudo apt-get install \
    --no-install-recommends \
    -y \
        autoconf \
        automake \
        clang \
        devscripts \
        r-base-core=${R_LINUX_VERSION} \
        r-base-dev=${R_LINUX_VERSION} \
        texinfo \
        texlive-latex-extra \
        texlive-latex-recommended \
        texlive-fonts-recommended \
        texlive-fonts-extra \
        tidy \
        qpdf

# install dependencies
Rscript \
    --vanilla \
    -e "install.packages(c('data.table', 'jsonlite', 'knitr', 'Matrix', 'R6', 'RhpcBLASctl', 'rmarkdown', 'testthat'), repos = '${CRAN_MIRROR}', lib = '${R_LIB_PATH}', dependencies = c('Depends', 'Imports', 'LinkingTo'), Ncpus = parallel::detectCores())"

# build LightGBM
mkdir -p ${HOME}/repos
cd ${HOME}/repos
git clone --recursive https://github.com/microsoft/LightGBM.git
cd ./LightGBM
sh build-cran-package.sh --no-build-vignettes

With gcc, I wasn't able to replicate the warnings from CRAN.

OMP_NUM_THREADS=16 \
_R_CHECK_EXAMPLE_TIMING_THRESHOLD_=0 \
_R_CHECK_EXAMPLE_TIMING_CPU_TO_ELAPSED_THRESHOLD_=2.5 \
R --vanilla CMD check \
    --no-codoc \
    --no-manual \
    --no-tests \
    --no-vignettes \
    --run-dontrun \
    --run-donttest \
    --timings \
    ./lightgbm_4.1.0.99.tar.gz
results (click me)
Examples with CPU (user + system) or elapsed time > 0s
                             user system elapsed
lgb.plot.interpretation     0.477  0.000   0.251
lgb.interprete              0.404  0.009   0.176
lgb.Dataset.create.valid    0.284  0.024   0.060
slice                       0.300  0.000   0.045
lgb.Dataset                 0.247  0.005   0.043
readRDS.lgb.Booster         0.190  0.011   0.062
set_field                   0.197  0.000   0.046
lgb.importance              0.170  0.023   0.125
lgb.Dataset.set.categorical 0.191  0.001   0.035
get_field                   0.189  0.001   0.046
lgb.Dataset.save            0.175  0.006   0.035
lgb.cv                      0.168  0.002   0.106
lgb.configure_fast_predict  0.162  0.000   0.051
lgb.model.dt.tree           0.140  0.000   0.109
lgb.Dataset.construct       0.131  0.004   0.035
lgb.Dataset.set.reference   0.124  0.005   0.034
saveRDS.lgb.Booster         0.121  0.000   0.091
lgb.plot.importance         0.120  0.000   0.083
lgb.convert_with_rules      0.112  0.000   0.016
dimnames.lgb.Dataset        0.087  0.015   0.040
lgb.load                    0.097  0.004   0.060
predict.lgb.Booster         0.096  0.000   0.055
lgb.restore_handle          0.078  0.000   0.059
lgb.dump                    0.076  0.000   0.053
lgb.train                   0.075  0.000   0.052
lgb.get.eval.result         0.073  0.000   0.052
lgb.save                    0.073  0.000   0.053
dim                         0.034  0.012   0.046

I noticed that CRAN was using clang 16.0.0 on the check where they rejected {lightgbm} v4.0.0 for example timings

** libs
using C++ compiler: ‘Debian clang version 16.0.6 (3)’

(CRAN install logs).

So I switched to clang...

cat << EOF > ${HOME}/.R/Makevars
CC=clang
CXX=clang++
CXX17=clang++
EOF

... and tried again.

OMP_NUM_THREADS=16 \
_R_CHECK_EXAMPLE_TIMING_THRESHOLD_=0 \
_R_CHECK_EXAMPLE_TIMING_CPU_TO_ELAPSED_THRESHOLD_=2.5 \
R --vanilla CMD check \
    --no-codoc \
    --no-manual \
    --no-tests \
    --no-vignettes \
    --run-dontrun \
    --run-donttest \
    --timings \
    ./lightgbm_4.1.0.99.tar.gz

That didn't reproduce the specific WARNING triggered on CRAN, about the lgb.restore_handle() example... but it did expose several places where CPU time was more than 2.5x elapsed time (the type of thing CRAN is warning about)!!!

Examples with CPU time > 2.5 times elapsed time
                          user system elapsed  ratio
readRDS.lgb.Booster      1.065  0.030   0.068 16.103
lgb.Dataset.create.valid 0.992  0.055   0.066 15.864
saveRDS.lgb.Booster      1.558  0.044   0.101 15.861
lgb.cv                   2.886  0.107   0.221 13.543
lgb.interprete           3.259  0.084   0.306 10.925
lgb.plot.interpretation  3.373  0.074   0.340 10.138
full results (click me)
Examples with CPU (user + system) or elapsed time > 0s
                             user system elapsed
lgb.plot.interpretation     3.373  0.074   0.340
lgb.interprete              3.259  0.084   0.306
lgb.cv                      2.886  0.107   0.221
saveRDS.lgb.Booster         1.558  0.044   0.101
readRDS.lgb.Booster         1.065  0.030   0.068
lgb.Dataset.create.valid    0.992  0.055   0.066
lgb.configure_fast_predict  0.938  0.024   0.060
get_field                   0.764  0.016   0.055
slice                       0.737  0.032   0.048
set_field                   0.695  0.016   0.049
lgb.Dataset                 0.675  0.013   0.045
lgb.Dataset.save            0.574  0.025   0.038
lgb.Dataset.set.categorical 0.572  0.023   0.037
lgb.Dataset.set.reference   0.570  0.024   0.038
lgb.Dataset.construct       0.584  0.008   0.037
dimnames.lgb.Dataset        0.291  0.011   0.059
lgb.convert_with_rules      0.290  0.008   0.018
lgb.importance              0.277  0.007   0.124
lgb.model.dt.tree           0.216  0.000   0.108
lgb.plot.importance         0.161  0.023   0.082
predict.lgb.Booster         0.134  0.005   0.052
lgb.get.eval.result         0.100  0.004   0.052
lgb.save                    0.095  0.008   0.052
lgb.dump                    0.093  0.008   0.051
lgb.train                   0.097  0.004   0.050
lgb.load                    0.083  0.012   0.058
lgb.restore_handle          0.080  0.000   0.059
dim                         0.046  0.000   0.046
Examples with CPU time > 2.5 times elapsed time
                          user system elapsed  ratio
readRDS.lgb.Booster      1.065  0.030   0.068 16.103
lgb.Dataset.create.valid 0.992  0.055   0.066 15.864
saveRDS.lgb.Booster      1.558  0.044   0.101 15.861
lgb.cv                   2.886  0.107   0.221 13.543
lgb.interprete           3.259  0.084   0.306 10.925
lgb.plot.interpretation  3.373  0.074   0.340 10.138

Now that we have a way to reproduce this, I'll start working through changes to address that.

I'm hoping that a mix of the following will be enough:

Getting closer 😁

@jameslamb
Copy link
Collaborator Author

Still experimenting with this, just want to add one happy finding... the issues appear to completely be about LightGBM's own use of multi-threading + possibly how it configures {data.table}.

Using data.table::setDTthreads(1) everywhere in R code and setting OMP_NUM_THREADS() in LightGBM to just always be 1, I got timings like the following with clang:

                             user system elapsed
lgb.plot.interpretation     0.246  0.000   0.245
lgb.interprete              0.171  0.000   0.173
lgb.importance              0.121  0.004   0.124
lgb.model.dt.tree           0.112  0.000   0.112
lgb.cv                      0.107  0.000   0.107
saveRDS.lgb.Booster         0.093  0.000   0.093
lgb.plot.importance         0.084  0.000   0.084
lgb.Dataset.create.valid    0.038  0.039   0.079
lgb.load                    0.056  0.012   0.068
readRDS.lgb.Booster         0.062  0.004   0.066
lgb.restore_handle          0.061  0.004   0.065
predict.lgb.Booster         0.060  0.000   0.060
lgb.save                    0.053  0.004   0.057
lgb.dump                    0.053  0.003   0.056
lgb.train                   0.056  0.000   0.056
lgb.get.eval.result         0.047  0.008   0.054
dim                         0.047  0.004   0.056
get_field                   0.047  0.004   0.051
lgb.Dataset                 0.041  0.010   0.052
lgb.configure_fast_predict  0.050  0.000   0.051
set_field                   0.050  0.000   0.050
slice                       0.050  0.000   0.049
dimnames.lgb.Dataset        0.046  0.000   0.046
lgb.Dataset.save            0.042  0.001   0.042
lgb.Dataset.construct       0.040  0.002   0.041
lgb.Dataset.set.categorical 0.038  0.004   0.042
lgb.Dataset.set.reference   0.036  0.000   0.035
lgb.convert_with_rules      0.016  0.000   0.017

That's a very good sign. It means that this issue isn't about some other source of multi-threading from LightGBM's dependencies, such as Eigen: https://eigen.tuxfamily.org/dox/TopicMultiThreading.html.

Will share more updates soon!

@jameslamb
Copy link
Collaborator Author

Status update for those subscribed to this issue:

  1. CRAN has threatened to archive {lightgbm} if we don't fix a new warning from clang by December 12: [R-package] Warnings of CRAN Package #6221
  2. I've put up [R-package] [c++] add tighter multithreading control, avoid global OpenMP side effects (fixes #4705, fixes #5102) #6226 to attempt to fix the multithreading issues that led to CRAN rejecting v4.0.0 ([R-package] v4.0.0 CRAN submission issues #5987)
  3. If that PR is accepted, we'll try to release the current state of master to CRAN as v4.2.0 in the next few days
  4. If not, I'll put up a patch that limits {lightgbm} to only using at most 2 threads everywhere, unconditionally and then try to release the current state of master to CRAN as v4.2.0

@simonpcouch I know the tidymodels team was waiting for for v4.x of {lightgbm} to hit CRAN (rstudio/bundle#55 (comment))... do you have any capacity to help us with reverse-dependency checks?

I'm planning to try checking reverse dependencies using a package built from the branch on #6226 over the next few days, but would welcome any help your team could provide if you have the capacity.

@simonpcouch
Copy link

We do have capacity to run reverse dependency checks! I'll start up some checks based on #6226 today.

@jameslamb
Copy link
Collaborator Author

Thank you so much!

@simonpcouch
Copy link

I see no new problems with #6226! Checked with revdepcheck's revdep_check() by passing a tarball built from the branch.

Note for CRAN:

## revdepcheck results

We checked 11 reverse dependencies, comparing R CMD check results across CRAN and dev versions of this package.

 * We saw 0 new problems
 * We failed to check 0 packages

@jameslamb
Copy link
Collaborator Author

Thank you SO MUCH for that! It's really helpful.

I thought for sure we'd have issue, especially since a few new packages have taken on {lightgbm} as a dependency since the last time I did revdep checks in anticipation of LightGBM 4.0.

Glad that that's not the case 😁

@simonpcouch
Copy link

Congrats on the CRAN release!

@jameslamb
Copy link
Collaborator Author

thanks again for your help @simonpcouch !

@jameslamb
Copy link
Collaborator Author

v4.2.0 of the R package has been accepted to CRAN and has passed all of their main checks: #6191 (comment)

This issue can be closed.

Thank you so much to everyone who helped with this!!!!

@jameslamb jameslamb unpinned this issue Jan 13, 2024
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants