[R-package] predict() breaks when using a Dataset stored in a file #4034

j-kreis · 2021-03-01T12:10:02Z

Description

On Windows R crashes using Dataset.lgb.save, without error message.
On Linux I am able to save the dataset, but lgb.predict can not find saved dataset

Reproducible example

For the Windows bug (the example given by lightgbm::lgb.Dataset.save)

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
train_file =  tempfile(fileext = ".bin")
lgb.Dataset.save(dtrain, train_file)

For the Linux bug (Example given by lightgbm::lgb.load + predict using a file as input)

data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(dtrain, test$data, label = test$label)

test_file =  file.path(getwd(), "test.bin")
lgb.Dataset.save(dtest, test_file)

params <- list(objective = "regression", metric = "l2")
valids <- list(test = dtest)
model <- lgb.train(params = params, data = dtrain, nrounds = 5L, 
                   valids = valids, learning_rate = 1.0, 
                   early_stopping_rounds = 3L)
model_file <- tempfile(fileext = ".txt")
lgb.save(model, model_file)
load_booster <- lgb.load(filename = model_file)
model_string <- model$save_model_to_string(NULL) # saves best iteration
load_booster_from_str <- lgb.load(model_str = model_string)

model$predict(test_file)

The error:

Error in lgb.call(fun_name = "LGBM_BoosterPredictForFile_R", ret = NULL,  : 
  [LightGBM] [Fatal] Data file ��?��V doesn't exist.

Environment info

LightGBM version or commit hash:

lightgbm_3.1.1

Command(s) you used to install LightGBM

install.packages('lightgbm')

The text was updated successfully, but these errors were encountered:

jameslamb · 2021-03-03T23:39:51Z

Thanks very much for using {lightgbm} and for the excellent bug report! It's possible that the Windows part of this is related to another not-yet-solved issue (#4007), but I'm not sure yet.

For Linux example, could you try changing uses of tempfile() to permanent files like file.path(getwd(), "model.txt") and let me know if that fixes it? Just to check that the problem you're facing is not specific to the use of tempfiles. It would also help if you could provide specific logs / error messages that you've summarized as "lgb.predict can not find saved dataset".

It will be another day or two before I'm able to look at this in depth, apologies.

j-kreis · 2021-03-04T07:36:37Z

Thanks for the quick response!! The description above now uses a permanent file and shows the error message, which is still there after updating the example.

jameslamb · 2021-03-11T02:21:20Z

I started looking into this tonight. I think the two issues might be unrelated but not sure yet, so it's ok to leave them here as one thing for now.

I was able to reproduce the "Data file doesn't exist" bug on my Mac, with slightly simpler sample code.

library(lightgbm)

# set up training data
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)

# set up scoring data
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(
    dataset = dtrain
    , data = test$data
    , label = test$label
)

test_file <- file.path(getwd(), "test.bin")
if (file.exists(test_file)) {
    file.remove(test_file)
}
lgb.Dataset.save(
    dataset = dtest
    , fname = test_file
)

model <- lgb.train(
    params = list(
        objective = "regression"
        , metric = "l2"
    )
    , data = dtrain
    , nrounds = 5L
    , learning_rate = 1.0
)

model$predict(test_file)

I saw this behavior on {lightgbm} 3.1.1 and on the latest commit of master (8d0669f)

jameslamb · 2021-04-02T04:19:49Z

@ticarki sorry for the delay in getting back to you.

For the Windows half of this issue, I'm confident now that it's the same as #4045. I just submitted a fix for that issue (#4155). I tried your Windows example above on the branch for #4155 and no longer see a crash. Could you please try it out? You can follow the steps at #4045 (comment) to install from that feature branch.

I haven't tested yet if the problem you saw on Linux is related. I suspect that it isn't. So for now, I'm going to change the name of this issue to just describe that problem. Let me know if you disagree with how I've rephrased the title.

jameslamb · 2021-05-03T22:00:03Z

I haven't looked at this again, yet. Some of the recent changes made as part of #3016 MIGHT end up fixing this.

If no one else does it sooner, I'll come back and try to reproduce this after #3016 is complete.

jameslamb · 2021-05-11T23:00:09Z

Ok, I came back to look at this tonight. I think that now, thanks to #4252, the reproducible examples above will produce a more informative error message.

Error in predictor$predict(data = data, start_iteration = start_iteration, :
[LightGBM] [Fatal] Unknown format of training data.

I realize now that the examples are trying to predict on a saved LightGBM Dataset. I don't think that is supported.

As @shiyu1994 said in #4210 (comment)

Once the model is trained, currently we don't have any support to use the trained model to evaluate a constructed Dataset.

I believe that LGBM_BoosterPredictForFile (the underlying method from LightGBM's C++ library) only currently supports TSV, CSV, and LibSVM formats:

LightGBM/src/application/predictor.hpp

Line 169 in f831808

    
           auto parser = std::unique_ptr<Parser>(Parser::CreateParser(data_filename, header, boosting_->MaxFeatureIdx() + 1, label_idx,

LightGBM/src/io/parser.cpp

Lines 232 to 239 in f831808

    
           Parser* Parser::CreateParser(const char* filename, bool header, int num_features, int label_idx, bool precise_float_parser) { 
        
             const int n_read_line = 32; 
        
             auto lines = ReadKLineFromFile(filename, header, n_read_line); 
        
             int num_col = 0; 
        
             DataType type = GetDataType(filename, header, lines, &num_col); 
        
             if (type == DataType::INVALID) { 
        
               Log::Fatal("Unknown format of training data."); 
        
             }

LightGBM/src/io/parser.cpp

Line 177 in f831808

DataType GetDataType(const char* filename, bool header,

@shiyu1994 am I right about that? If I am, I can update the documentation to clarify the supported file types.

@ticarki if you want to get predictions from a trained model and want to do that on data stored in a file, you'll have to use raw data in one of those formats for now.

Adding this to the end of the code from #4034 (comment) worked for me.

test_csv <- file.path(getwd(), "test.csv")
write.table(
    x = as.matrix(test$data)
    , file = test_csv
    , row.names = FALSE
    , col.names = FALSE
    , sep = ","
)
preds_from_file <- model$predict(test_csv, header = FALSE)
preds_in_mem <- model$predict(out_data)
identical(preds_from_file, preds_in_mem)

shiyu1994 · 2021-05-18T09:16:23Z

@jameslamb Yes. Currently a Dataset loaded from binary file (or a binary file itself) cannot be used as input to the prediction methods. However

As @shiyu1994 said in #4210 (comment)

Once the model is trained, currently we don't have any support to use the trained model to evaluate a constructed Dataset.

This claim is wrong as pointed out by @StrikerRUS in #4210 (comment). We can use the eval method to evaluate a constructed Dataset with a trained Booster.

jameslamb · 2021-08-22T04:24:53Z

In #4545, I've proposed some documentation changes and an error message change to try to make it a bit clearer that only text files are supported in predict().

For anyone finding this issue, you can try the following sample code with the R package to evaluate a constructed Dataset stored in a file.

library(lightgbm)

# set up training data
data(agaricus.train, package = "lightgbm")
train <- agaricus.train
dtrain <- lgb.Dataset(train$data, label = train$label)

# set up scoring data
data(agaricus.test, package = "lightgbm")
test <- agaricus.test
dtest <- lgb.Dataset.create.valid(
    dataset = dtrain
    , data = test$data
    , label = test$label
)
dtest$construct()

test_file <- file.path(getwd(), "test.bin")
if (file.exists(test_file)) {
    file.remove(test_file)
}
lgb.Dataset.save(
    dataset = dtest
    , fname = test_file
)

model <- lgb.train(
    params = list(
        objective = "regression"
        , metric = "l2"
        , learning_rate = 1.0
    )
    , data = dtrain
    , nrounds = 5L
)

# evaluate constructed dataset
model$eval(
    data = lgb.Dataset(
        data = test_file
    )$construct()
    , name = "test_set"
)

…ed Datasets (fixes #4034) (#4545) * documentation changes * add list of supported formats to error message * add unit tests * Apply suggestions from code review Co-authored-by: Nikita Titov <nekit94-08@mail.ru> * update per review comments * make references consistent Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

github-actions · 2023-08-23T14:21:52Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS added bug r-package labels Mar 1, 2021

jameslamb mentioned this issue Mar 8, 2021

v3.2.0 release #3872

Merged

jameslamb mentioned this issue Apr 2, 2021

[R-package] prevent symbol lookup conflicts (fixes #4045) #4155

Merged

jameslamb changed the title ~~[R-package] Issues with saving and reading Datasets~~ [R-package] predict() breaks when using a Dataset stored in a file Apr 2, 2021

This was referenced Apr 26, 2021

[R-package] fix warnings in unit tests #4225

Merged

[R-package] FATAL Error when run train() #4045

Closed

This was referenced May 4, 2021

[R-package] Use R standard routines to access character data in C++ #4252

Merged

[R-package] manage Dataset and Booster handles as R external pointers (fixes #3016) #4265

Merged

jameslamb mentioned this issue May 20, 2021

release 3.3.0 #4310

Closed

21 tasks

This was referenced Aug 22, 2021

[docs] Clarify the fact that predict() on a file does not support saved Datasets (fixes #4034) #4545

Merged

Enable use of constructed Dataset in predict() methods #4546

Closed

jameslamb closed this as completed in #4545 Aug 25, 2021

jameslamb mentioned this issue Jan 2, 2022

python api can't continue train with binary file data #4311

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-package] predict() breaks when using a Dataset stored in a file #4034

[R-package] predict() breaks when using a Dataset stored in a file #4034

j-kreis commented Mar 1, 2021 •

edited

Loading

jameslamb commented Mar 3, 2021

j-kreis commented Mar 4, 2021

jameslamb commented Mar 11, 2021 •

edited

Loading

jameslamb commented Apr 2, 2021

jameslamb commented May 3, 2021

jameslamb commented May 11, 2021

shiyu1994 commented May 18, 2021

jameslamb commented Aug 22, 2021

github-actions bot commented Aug 23, 2023

[R-package] predict() breaks when using a Dataset stored in a file #4034

[R-package] predict() breaks when using a Dataset stored in a file #4034

Comments

j-kreis commented Mar 1, 2021 • edited Loading

Description

Reproducible example

Environment info

jameslamb commented Mar 3, 2021

j-kreis commented Mar 4, 2021

jameslamb commented Mar 11, 2021 • edited Loading

jameslamb commented Apr 2, 2021

jameslamb commented May 3, 2021

jameslamb commented May 11, 2021

shiyu1994 commented May 18, 2021

jameslamb commented Aug 22, 2021

github-actions bot commented Aug 23, 2023

j-kreis commented Mar 1, 2021 •

edited

Loading

jameslamb commented Mar 11, 2021 •

edited

Loading