-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-package] Loading binary dataset does not consider the creation non-default params #4904
Comments
Thanks for using Short Answer Calling Longer Answer LightGBM doesn't train on your raw data (e.g. matrix or
When you create one of these objects in-memory with However...once This is what's happening in the example code you've provided. The file stored with |
@jameslamb Thank you for the prompt and informative response! I'm afraid we may have failed to properly explain the issue. Our problem isn't with Here's an example without reproduce.lgb.Dataset.bin.save.bug.v2 <- function() {
library(lightgbm)
set.seed(5)
nn <- 1e5
xx <- cbind(x=rnorm(nn))
yy <- xx + rnorm(nn)
lgb.dataset.params <- list(bin_construct_sample_cnt=floor(nn/2))
data1 <- lgb.Dataset(data=xx, label=yy, params=lgb.dataset.params)
cat("#constructing data1 with bin_construct_sample_cnt params, [SUCCEEDS]\n")
data1.fn <- tempfile()
data1$save_binary(data1.fn)
cat("#Learning with loaded dataset fails, [FAILS]\n")
data2 <- lgb.Dataset(data1.fn)
try(mdl <- lgb.train(data=data2, obj="regression"))
cat("#Learning with loaded dataset but with passing lgb.dataset.params of original dataset succeeds, [SUCCEEDS]\n")
data3 <- lgb.Dataset(data1.fn, params=lgb.dataset.params)
mdl2 <- lgb.train(data=data3, obj="regression")
} Are we still missing something? |
ahhh ok, thanks for clarifying @OfekShilon ! And for providing some sample code. Based on that, I've reproduced this with the following simplified example and tested it against the latest commit on minimal reproducible example (click me)library(lightgbm)
DATASET_FILE <- tempfile(fileext = ".bin")
X <- matrix(
rnorm(n = 10000L)
, nrow = 1000L
)
y <- rnorm(n = 1000L)
dtrain <- lgb.Dataset(
data = X
, label = y
, params = list(
bin_construct_sample_cnt = 1234
)
)
dtrain$save_binary(DATASET_FILE)
dtrain_from_file <- lgb.Dataset(DATASET_FILE)
bst <- lgb.train(
params = list(
objective = "regression"
)
, data = dtrain_from_file
)
This validation was added in #3592. Based on the conversation in #3577, I think the intention for this validation was to catch the case where you've explicitly provided Dataset parameters that are inconsistent with the saved file. I think that the scenario of "I created a Dataset using non-default parameter values, saved it to binary, and then want to re-use it without changing the parameter values" wasn't considered, and as a result #3592 broke support for that scenario. My summary of the conversation here so far is as follows:
What you can do right nowFor now, I recommend storing parameters alongside the Dataset file. I'm sorry for the inconvenience, but hopefully that will unblock you until this pattern is supported in a future version of LightGBM. example code (click me)library(jsonlite)
library(lightgbm)
PARAMS_FILE <- "lgb-model-params.json"
DATASET_FILE <- "lgb-model.bin"
X <- matrix(
rnorm(n = 10000L)
, nrow = 1000
)
y <- rnorm(n = 1000L)
dataset_params <- list(max_bin = 123)
dtrain <- lgb.Dataset(
data = X
, label = y
, params = dataset_params
)
# save Dataset and params
dtrain$save_binary(DATASET_FILE)
jsonlite::write_json(
x = dataset_params
, path = PARAMS_FILE
)
# load Dataset and params from files
dtrain_from_file <- lgb.Dataset(
data = DATASET_FILE
, params = jsonlite::read_json(
path = PARAMS_FILE
, simplifyVector = TRUE
)
)
bst <- lgb.train(
params = list(
objective = "regression"
)
, data = dtrain_from_file
) Questions for maintainers / contributorsThroughout LightGBM, "params" isn't used to represent "all of the configuration for LightGBM" but more like "overrides to default configuration values". I think the intention of #3592 was to raise an error when users pass a value through If users pass an empty
@StrikerRUS @shiyu1994 @guolinke @cyfdecyf Do you agree with that interpretation, and would you support a PR that changes LightGBM's behavior in this situation to match that expectation? Thanks! |
@jameslamb Thanks a lot for the recap! Yes, I agree with your interpretation. |
Looking at the changes in #3592, there are other parameters that's checked. To support @jameslamb's interpretation, we need to find a general way to mark whether a parameter is actually specified by user or it's just the default. Is there any parameter that's not only stored in the binary dataset which also controls the behavior for training? If this is the case, then binary dataset's non default paramter would change the logic of training, I'm not sure if this would be considered as expected. (I'm suspecting |
A year had passed.. Is this fixed by #5424 ? When can we expect a fix? |
Thanks for using LightGBM @OfekShilon . This has not been fixed yet.
As far as I know, no one is actively working on this right now. LightGBM's small team of maintainers is currently focused on the remaining work necessary for the next major release of the project (#5153). So I cannot give you an estimated date when this will be fixed. This is an open source project and we'd welcome contribution if you or anyone else reading this thread would like to submit a pull request with a proposed fix. |
Description
When we create a dataset with non default parameters, save it and load it -
construct()
breaks.Reproducible example
Gives error:
Environment info
LightGBM version or commit hash:
The text was updated successfully, but these errors were encountered: