Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-package] Loading binary dataset does not consider the creation non-default params #4904

Open
Tracked by #5153
OfekShilon opened this issue Dec 22, 2021 · 7 comments

Comments

@OfekShilon
Copy link

Description

When we create a dataset with non default parameters, save it and load it - construct() breaks.

Reproducible example

library(lightgbm)
nn <- 1e5
xx <- cbind(x=seq.int(nn)%%5, y=seq.int(nn)%%17)
yy <- sin(seq.int(nn))

## lgb.dataset.params <- list(bin_construct_sample_cnt=floor(nn/2), max_bin=16)
lgb.dataset.params <- list(bin_construct_sample_cnt=floor(nn/2))

ref.data <- lgb.Dataset(data=xx, label=yy, params=lgb.dataset.params)
cat("#constructing ref.data with bin_construct_sample_cnt params, [SUCCEEDS]\n")
ref.data$construct()
ref.data.fn <- tempfile()
ref.data$save_binary(ref.data.fn)
ref.data2 <- lgb.Dataset(ref.data.fn)    # In real life scenarios it is hard to pass here the original construction params, and they're available in the file anyway.
ref.data2$construct()

Gives error:

[LightGBM] [Fatal] Dataset bin_construct_sample_cnt 50000 != config 200000
Error in ref.data2$construct() : 
  Dataset bin_construct_sample_cnt 50000 != config 200000

Environment info

> sessionInfo()
R version 4.0.5
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /opt/R-4.0.5.mkl2020/lib64/R/lib/libRblas.so
LAPACK: /opt/R-4.0.5.mkl2020/lib64/R/lib/libRlapack.so

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 
 
locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C         LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C          
 [8] LC_NAME=C            LC_ADDRESS=C         LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C 

attached base packages:
[1] datasets  utils     stats     graphics  grDevices methods   base     

other attached packages:
 [1] lightgbm_3.3.1       RcppRedis_0.1.11     stringi_1.7.6        fst_0.9.4            RsgeMy_0.6.3         istraCinfra_1.0      istrainfraEprice_1.0
 [8] istrainfraFee_1.0    istrainfra8_1.0      istrainfra7_1.0      istrainfra6_1.0      istrainfra5_1.0      istrainfra4_1.0      istrainfra3_1.0     
[15] istrainfra2_1.0      istrainfra1_1.0      istrainfraUtils_1.0  istrainfra_1.0       istratests_1.0       xparam_1.0           bong_1.0            
[22] R6_2.5.1             bit64_4.0.6          bit_4.0.4            magrittr_2.0.1       XML_3.99-0.8         data.table_1.14.3    hwriter_1.3.2       
[29] rjson_0.2.20         plyr_1.8.6           bitops_1.0-7         snow_0.4-4           numbers_0.8-2        digest_0.6.29        RMySQL_0.10.22      
[36] DBI_1.1.1            MASS_7.3-54          ggplot2_3.3.5        crayon_1.4.2        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7          lattice_0.20-45     assertthat_0.2.1    utf8_1.2.2          pillar_1.6.4        rlang_0.4.12        Matrix_1.3-4       
 [8] munsell_0.5.0       tinytex_0.35        compiler_4.0.5      xfun_0.28           pkgconfig_2.0.3     tidyselect_1.1.1    tibble_3.1.6       
[15] codetools_0.2-18    fansi_0.5.0         dplyr_1.0.7         withr_2.4.3         grid_4.0.5          jsonlite_1.7.2      gtable_0.3.0       
[22] lifecycle_1.0.1     scales_1.1.1        ellipsis_0.3.2      generics_0.1.1      vctrs_0.3.8         RApiSerialize_0.1.0 tools_4.0.5        
[29] glue_1.5.1          purrr_0.3.4         parallel_4.0.5      colorspace_2.0-2   

LightGBM version or commit hash:

> packageVersion("lightgbm")
[1] '3.3.1'
@OfekShilon OfekShilon changed the title Loading binary dataset does not consider the creation non-default params [R package] Loading binary dataset does not consider the creation non-default params Dec 22, 2021
@jameslamb
Copy link
Collaborator

Thanks for using {lightgbm} and for the excellent write-up!

Short Answer

Calling $construct() on a Dataset created from a binary file created with Dataset$save_binary() is unnecessary. Could you tell me more about what you are trying to get {lightgbm} to do by calling $construct() on that loaded file?

Longer Answer

LightGBM doesn't train on your raw data (e.g. matrix or data.frame) directly. Instead, it uses raw training data to create a Dataset object, which holds the results of a bunch of preprocessing, including but not limited to:

  • bucketing continuous features into histograms (and encoding those features' values as just the values of the bin boundaries)
  • bundling sparse features
  • dropping features that are unsplittable based on the parameters you've provided

When you create one of these objects in-memory with lightgbm::lgb.Dataset(), the parameter free_raw_data can be used to control whether or not {lightgbm} keeps around a copy of the raw data. Per the docs, it defaults to TRUE. That means that after a lgb.Dataset() call, if you've set free_raw_data=FALSE during creation, you cannot do things like change parameters.

However...once {lightgbm} no long knows about the raw data, and only has a constructed Dataset, it isn't possible to change things. Some of the processing, like bucketing features into histograms, is non-reversible.

This is what's happening in the example code you've provided. The file stored with Dataset$save_binary() has already been "constructed", and calling $construct() on it again isn't necessary.

@jameslamb jameslamb changed the title [R package] Loading binary dataset does not consider the creation non-default params [R-package] Loading binary dataset does not consider the creation non-default params Dec 22, 2021
@OfekShilon
Copy link
Author

@jameslamb Thank you for the prompt and informative response! I'm afraid we may have failed to properly explain the issue. Our problem isn't with construct itself, it just serves as an example. Also the example is in R only because that's what we use, I believe we could achieve the same on CLI. Essentially the scenario is:
(1) Create saved binary dataset,
(2) Run a learning setup using the saved binary dataset but without lgb.dataset params as used in the original dataset.

Here's an example without construct:

reproduce.lgb.Dataset.bin.save.bug.v2 <- function() {
  library(lightgbm)
  set.seed(5)
  nn <- 1e5
  xx <- cbind(x=rnorm(nn))
  yy <- xx + rnorm(nn)

  lgb.dataset.params <- list(bin_construct_sample_cnt=floor(nn/2))

  data1 <- lgb.Dataset(data=xx, label=yy, params=lgb.dataset.params)
  cat("#constructing data1 with bin_construct_sample_cnt params, [SUCCEEDS]\n")
  data1.fn <- tempfile()
  data1$save_binary(data1.fn)

  cat("#Learning with loaded dataset fails, [FAILS]\n")
  data2 <- lgb.Dataset(data1.fn)
  try(mdl <- lgb.train(data=data2, obj="regression"))
 
  cat("#Learning with loaded dataset but with passing lgb.dataset.params of original dataset succeeds, [SUCCEEDS]\n")
  data3 <- lgb.Dataset(data1.fn, params=lgb.dataset.params)
  mdl2 <- lgb.train(data=data3, obj="regression")
}

Are we still missing something?

@jameslamb jameslamb added the bug label Jan 3, 2022
@jameslamb
Copy link
Collaborator

ahhh ok, thanks for clarifying @OfekShilon ! And for providing some sample code.

Based on that, I've reproduced this with the following simplified example and tested it against the latest commit on master (af5b40e).

minimal reproducible example (click me)
library(lightgbm)

DATASET_FILE <- tempfile(fileext = ".bin")

X <- matrix(
    rnorm(n = 10000L)
    , nrow = 1000L
)
y <- rnorm(n = 1000L)

dtrain <- lgb.Dataset(
    data = X
    , label = y
    , params = list(
        bin_construct_sample_cnt = 1234
    )
)
dtrain$save_binary(DATASET_FILE)

dtrain_from_file <- lgb.Dataset(DATASET_FILE)
bst <- lgb.train(
    params = list(
        objective = "regression"
    )
    , data = dtrain_from_file
)

[LightGBM] [Fatal] Dataset was constructed with parameter max_bin=123. It cannot be changed to 255 when loading from binary file.
Error in data$construct() :
Dataset was constructed with parameter max_bin=123. It cannot be changed to 255 when loading from binary file.

This validation was added in #3592. Based on the conversation in #3577, I think the intention for this validation was to catch the case where you've explicitly provided Dataset parameters that are inconsistent with the saved file. I think that the scenario of "I created a Dataset using non-default parameter values, saved it to binary, and then want to re-use it without changing the parameter values" wasn't considered, and as a result #3592 broke support for that scenario.

My summary of the conversation here so far is as follows:

params when creating initial Dataset params when using Dataset stored in binary file current behavior desired behavior
{"max_bin": 255} {} no error no error
{} {"max_bin": 255} no error no error
{"max_bin": 255} {"max_bin": 255} no error no error
{} {} no error no error
{} {"max_bin": 123} ERROR ERROR
{"max_bin": 190} {"max_bin": 123} ERROR ERROR
{"max_bin": 123} {"max_bin": 123} no error no error
{"max_bin": 123} {} ERROR no error

What you can do right now

For now, I recommend storing parameters alongside the Dataset file. I'm sorry for the inconvenience, but hopefully that will unblock you until this pattern is supported in a future version of LightGBM.

example code (click me)
library(jsonlite)
library(lightgbm)

PARAMS_FILE <- "lgb-model-params.json"
DATASET_FILE <- "lgb-model.bin"

X <- matrix(
    rnorm(n = 10000L)
    , nrow = 1000
)
y <- rnorm(n = 1000L)

dataset_params <- list(max_bin = 123)

dtrain <- lgb.Dataset(
    data = X
    , label = y
    , params = dataset_params
)

# save Dataset and params
dtrain$save_binary(DATASET_FILE)
jsonlite::write_json(
    x = dataset_params
    , path = PARAMS_FILE
)

# load Dataset and params from files
dtrain_from_file <- lgb.Dataset(
    data = DATASET_FILE
    , params = jsonlite::read_json(
        path = PARAMS_FILE
        , simplifyVector = TRUE
    )
)
bst <- lgb.train(
    params = list(
        objective = "regression"
    )
    , data = dtrain_from_file
)

Questions for maintainers / contributors

Throughout LightGBM, "params" isn't used to represent "all of the configuration for LightGBM" but more like "overrides to default configuration values". I think the intention of #3592 was to raise an error when users pass a value through params which is different from the configuration used when creating the Dataset.

If users pass an empty params when using lgb.Dataset() on a binary file, I believe that should be interpreted as:

use the configuration values stored in the Dataset file", not "try to update the Dataset loaded from this file to use the configuration defaults from config.h

@StrikerRUS @shiyu1994 @guolinke @cyfdecyf Do you agree with that interpretation, and would you support a PR that changes LightGBM's behavior in this situation to match that expectation? Thanks!

@StrikerRUS
Copy link
Collaborator

@jameslamb Thanks a lot for the recap! Yes, I agree with your interpretation.

@cyfdecyf
Copy link
Contributor

cyfdecyf commented Jan 6, 2022

Looking at the changes in #3592, there are other parameters that's checked. To support @jameslamb's interpretation, we need to find a general way to mark whether a parameter is actually specified by user or it's just the default.

Is there any parameter that's not only stored in the binary dataset which also controls the behavior for training? If this is the case, then binary dataset's non default paramter would change the logic of training, I'm not sure if this would be considered as expected. (I'm suspecting use_missing and zero_as_missing, not sure about this since I didn't dig deep into training logic.)

@OfekShilon
Copy link
Author

A year had passed.. Is this fixed by #5424 ? When can we expect a fix?

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM @OfekShilon . This has not been fixed yet.

When can we expect a fix?

As far as I know, no one is actively working on this right now. LightGBM's small team of maintainers is currently focused on the remaining work necessary for the next major release of the project (#5153). So I cannot give you an estimated date when this will be fixed.

This is an open source project and we'd welcome contribution if you or anyone else reading this thread would like to submit a pull request with a proposed fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants