Feat/optimize single prediction #2992

AlbertoEAF · 2020-04-11T18:42:49Z

This follows in sequence of these findings: #2935.

A new struct is added to the C API, along with some methods to leverage it for faster single-row predictions.
There are no impacts to the current API as all changes are new independent additions:

The method of interest: LGBM_BoosterPredictForMatSingleRowFast()
Auxiliary methods:
- LGBM_BoosterPredictForMatSingleRowFastInit() - instantiates the FastConfig
- LGBM_FastConfigFree() - releases the FastConfig

This method splits up set-up from scoring. As it reduces memory allocations it reduces not only the throughtput but latency as well.

It however requires 2 more calls from who is using it, one for setup of FastConfig, and another for its release.

To score one simply passes the FastConfig and not the array data again, as this also "caches" the get_row_fun. Basically to score new instances, one just replaces the data in the single-row features array and makes a new FastPredict call.

I also have a commit for profiling this, which I didn't merge as it would add more files to the repo, but shows how one can score:
AlbertoEAF@d965ba1

Don't know however, if we shouldn't start adding C tests as well. The ones at that branch can be explored for that purpose.

Initially there was just a static FastConfig instantiated inside c_api.cpp which the user of the C APi didn't even know about, but decided to change it to be public, so that many can be managed. I think this is ready to start the review process :)

jameslamb · 2020-04-13T23:24:49Z

Hi @AlbertoEAF sorry for the inconvenience, but can you please merge to most recent master? We just merged some fixes for the continuous integration failures we've had recently. It should make your build green.

AlbertoEAF · 2020-04-19T10:36:05Z

Hello @jameslamb, no problem, but it seems there are still some issues, namely linting in codes I didn't touch. I'm assuming it's best not to touch those as someone else who added that is likely to fix them in the master. I'm not sure when that will be fixed to rebase now though.

StrikerRUS · 2020-04-19T13:44:37Z

@AlbertoEAF Hi!

Those linting errors are introduced in this PR.

src/c_api.cpp:1676:  "private:" should be preceded by a blank line  [whitespace/blank_line] [3]
src/c_api.cpp:1699:  At least two spaces is best between code and comments  [whitespace/comments] [2]
src/c_api.cpp:1701:  Closing ) should be moved to the previous line  [whitespace/parens] [2]
Done processing src/c_api.cpp

Please fix them.

Also there are some problems (forgotten argument?) with Doxygen documentation for new functions that should be resolved before starting a review process for this PR.

/home/travis/build/microsoft/LightGBM/include/LightGBM/c_api.h:915: error: argument 'fast_config_handle' of command @param is not found in the argument list of LGBM_BoosterPredictForMatSingleRowFast(FastConfigHandle fastConfig_handle, int predict_type, int num_iteration, int64_t *out_len, double *out_result) (warning treated as error, aborting now)

AlbertoEAF · 2020-04-23T18:59:30Z

Thanks for the pointer @StrikerRUS! , I misread the linter as if the changes were on another file, my bad.

@imatiach-msft: I want to know if reusing the input features array is a limitation to your use case.

One could choose to pass at every prediction the pointer to the new input features address, but by reusing the same data address (which implies you must overwrite it prior to a new prediction), brings a tiny extra speed up still, and helps latencies by avoiding more allocations (lambda capture followed by get_row_fun's std::function creation).

AlbertoEAF · 2020-04-29T15:10:48Z

@imatiach-msft ping :D

imatiach-msft · 2020-04-29T15:44:42Z

@AlbertoEAF yes, this is very nice. I don't think re-using the same native data array is a limitation, but I think there is some misunderstanding, as mmlspark uses the methods LGBM_BoosterPredictForCSRSingle for sparse prediction which @eisber added:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala#L247
and LGBM_BoosterPredictForMatSingle for dense prediction:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala#L266
I believe these take the native representation from java array directly, so there shouldn't be any memory churn, unless I am misunderstanding something, eg see:
https://github.com/microsoft/LightGBM/blob/master/swig/lightgbmlib.i#L119

    int* indices0 = (int*)jenv->GetPrimitiveArrayCritical(indices, &isCopy);
    double* values0 = (double*)jenv->GetPrimitiveArrayCritical(values, &isCopy);

imatiach-msft · 2020-04-29T15:47:02Z

also it looks like the current code only addresses the dense case, but we also use CSR predict function for sparse case

imatiach-msft · 2020-04-29T15:58:00Z

"helps latencies by avoiding more allocations (lambda capture followed by get_row_fun's std::function creation)."

sorry could you explain this a bit more - What is the lambda capture that you are referring to specifically? Would this be an issue for mmlspark or is this only for native-only callers who are not using mmlspark? For mmlspark, there shouldn't be more memory allocated for the input data, just memory passed from java data structures directly that were previously already allocated.

The get_row_fun's std::function creation seems like it could be separate from re-using the same memory for the input data, eg that part could be optimized even if a new array was passed in every time? Maybe I am missing something there.

AlbertoEAF · 2020-05-02T12:13:34Z

Hmm interesting @imatiach-msft,

In the Java implementation I'm working on, a C array is initialized at program startup (pointer swigResources.swigInstancePtr), and for every new runtime instance, I get new features from an external source in a custom object Instance, and then copy those values into the C array. I was using:

private void initSWIGInstanceArray(final Instance instance) {

    int skipTargetOffset = 0; // set to 1 after passing the target (if it exists)
    for (int i = 0; i < this.schemaNumFields; ++i) {
        // If the label is not present, targetIndex=-1, thus "if" wont't trigger:
        if (i == this.schemaTargetIndex) {
            skipTargetOffset = -1;
        } else {
            lightgbmlibJNI.doubleArray_setitem(
                    this.swigResources.swigInstancePtr,
                    i + skipTargetOffset,
                    instance.getValue(i)
            );
        }
    }
}

and then use that pointer for the prediction call. Thus, my address is always the same.

To use that function I'd need first to copy to a Java double array and then use that array data (which is then passed to C by GetPrimitiveArrayCritical. Maybe the overhead of copying the data to a Java double array and then using that function relying on GetPrimitiveArrayCritical is smaller when there are many features (given the amount of JNI calls I have to do is proportional to #features)? If it requires a copy though, I'm not sure if it will be slower (2 copies total).

With the GetPrimitiveArrayCritical version there are no guarantees that we won't get a copy (thus a new address), so I guess the address of the feature array data cannot be set only once and reused, right @imatiach-msft ?

When I spoke of the lambda capture & std::function I meant both steps on the C++ side. Clearly the biggest overhead there comes from creating the new std::function to accomodate the new features address.

The get_row_fun's std::function creation seems like it could be separate from re-using the same memory for the input data, eg that part could be optimized even if a new array was passed in every time?

How do you suggest doing that @imatiach-msft ?
I'm not seeing how to replace this:

LightGBM/src/c_api.cpp

Line 1640 in 7076cb8

    
           auto get_row_fun = RowPairFunctionFromDenseMatric(data, 1, ncol, data_type, is_row_major);

Right now a std::function is generated and data is not one of its parameters, we'd have to change all the code that relies on it so we could pass an extra pointer data at the instant of getting the values? And thus also have to keep track of that pointer data when we need to use that function? Doesn't seem very practical to me, it would likely be best to not assume that the address is the same and drop this optimization. Unless you have a better suggestion, to which I'm all ears! :)

Thanks!

AlbertoEAF · 2020-05-09T11:32:15Z

Gentle ping to @imatiach-msft :)

Pinging as well (RFC) @guolinke @StrikerRUS as maybe you guys also have input regarding eliminating the generation of a new std::function RowPairFunctionFromDenseMatric on every prediction with new data addresses :)

Thanks guys, I appreciate any input here :)

AlbertoEAF · 2020-05-15T00:26:56Z

@imatiach-msft I decided to remove the optimization of reusing the function.

We lose 5% of the 30% speed improvement but now it is usable for my case (generic) and MMLSpark's (with the GetCritical calls).

How do you want to deal with the Java wrappers?

Have the current version in lightgbmlib.i as it requires no separate steps for setup and teardown AND the new one? Or replace that by the scoring call of the new one? (Setup and Teardown can use directly the C API, no need for wrapping there).

Regarding the CSC optimizations I think we can leave that for later? You already use that only for specific cases, the general one is the DenseRow right?
(Adding it should bring no more complexity, I believe the same FastConfig can be used, it's just rinse and repeat).

imatiach-msft · 2020-05-15T15:31:50Z

@AlbertoEAF you have a lot of questions :)
"We lose 5% of the 30% speed improvement but now it is usable for my case (generic) and MMLSpark's (with the GetCritical calls)."
is there a way we could enable both cases? 5% improvement would still be nice to have.

"To use that function I'd need first to copy to a Java double array and then use that array data"
No, I think the way you are doing it right now is much better, don't copy to a separate Java array just for this.

Do you actually have to predict one row at a time from your program? You could probably improve performance much more by batching, if that was possible in your scenario, putting multiple instances in a single native array and then calling predict once?

"GetPrimitiveArrayCritical version there are no guarantees that we won't get a copy"
That seems like another good reason for why you should do it the way you are doing now and not create a separate Java array just for this, it doesn't really make sense I think if you are trying to really optimize performance? That method for Spark is really optimized just for Spark.

"Right now a std::function is generated and data is not one of its parameters, we'd have to change all the code that relies on it so we could pass an extra pointer data at the instant of getting the values? "
At first I thought yes, then with your hesitation I thought no, but then looking at the code more it seems like this might work, you can pass in data either to the function, or you could pass in a double pointer to data and modify which row it points to between calls, which may make the code more confusing but you wouldn't need to refactor as much:

Option 1:
std::function<std::vector<double>(int row_idx, const void* data)>
RowFunctionFromDenseMatric(int num_row, int num_col, int data_type, int is_row_major) {
Option 2:
std::function<std::vector<double>(int row_idx)>
RowFunctionFromDenseMatric(void** data_ptr, int num_row, int num_col, int data_type, int is_row_major) {

Agree that it might not be worth it if it involves a lot of refactoring. If that actually is hurting performance, it may make sense to refactor it. I would check with @guolinke first though.

Actually, looking through the code more, I see some optimizations that could be made. We are taking in an array, copying it to a vector from index->value, and then in predict function we are again copying it to a buffer by accessing indexes in dense case - but it seems like we could just bypass this and copy the original data array:

LightGBM/src/application/predictor.hpp

Line 107 in 0aa7bfe

CopyToPredictBuffer(predict_buf_[tid].data(), features);

See LGBM_BoosterPredictForMatSingleRow:

LightGBM/src/c_api.cpp

Line 1640 in 7076cb8

    
           auto get_row_fun = RowPairFunctionFromDenseMatric(data, 1, ncol, data_type, is_row_major);

See the code that calls PredictSingleRow and predict_function:

LightGBM/src/c_api.cpp

Line 380 in 7076cb8

single_row_predictor_[predict_type]->predict_function(one_row, pred_wrt_ptr);

See code which creates the Predict function:
https://github.com/microsoft/LightGBM/blob/master/src/c_api.cpp#L418
See predict_function which is call to GetPredictFunction:

LightGBM/src/c_api.cpp

Line 77 in 7076cb8

predict_function = predictor_->GetPredictFunction();

This optimization might take too much time to implement though and I'm not sure how much it's worth it to make those complex changes in the code. Conceptually though, there is room for some optimization because the intermediate vector representation isn't really useful in the dense case and it just increases memory churn and time to copy over data, even though in the end it's not even the data structure that is used for prediction.

"Regarding the CSC optimizations I think we can leave that for later? You already use that only for specific cases, the general one is the DenseRow right?"
Sparse scenario is super important, especially with the work I am doing right now with (Microsoft's AzureML) AutoML and their model interpretability feature, where if their data has strings it is almost always featurized into a sparse representation, which is why I'm trying to add a sparse TreeShap implementation here: #3000. That's not related to MMLSpark's LightGBM though, where sparse structures aren't used as often, although sparse scenario is very important in many use-cases - I don't know the exact ratio, I would guess maybe 70 dense/30 sparse percent split?

AlbertoEAF · 2020-05-15T16:50:11Z

especially with the work I am doing right now with (Microsoft's AzureML) AutoML and their model interpretability feature,

Cool stuff!

I hope to have time this week to get back to you on all that you said :p

AlbertoEAF · 2020-05-23T12:54:42Z

Regarding your last comment @imatiach-msft, I agree that optimizing that would be nice, but I'd rather avoid it as passing the original data pointer around further complicates the calls. @guolinke what are your thoughts on that?

Do you actually have to predict one row at a time from your program?

Yes I do :) It's meant for real-time use, as little latency as possible, so I cannot wait for more events.

Sparse scenario is super important
I'll try to add the corresponding sparse methods as well to SWIG then.

The sparse predictors like LGBM_BoosterPredictForCSR take num_col as int64_t, but LGBM_BoosterPredictForMatSingleRow takes int32_t.

As lots of code depends on that I didn't change int64_t to int32_t, so I added an IntUnion to FastConfig, so any type can be used, which meant not changing any existing functions in the C API, although in LGBM_BoosterPredictForCSRSingleRow, there was already a cast of num_col from int64_t to int32_t, further reinforcing the point above.

REVIEW REQUEST

I think the code is done, can you review? @imatiach-msft , @guolinke, @StrikerRUS

I also added the helpers in SWIG for the new dense and sparse variants which rely on the Critical JNI call as well for maximum speed @imatiach-msft ;)

I tested by introducing 2 profiling C++ "tests" with which you can compare the time of the LGBM_BoosterPredictForMatSingleRowFast and LGBM_BoosterPredictForMatSingleRow function calls and scores.

Each .cpp scores the same instance a bunch of times and prints the last call's score. You can see how they are the same, and the Fast variant is with the Higgs model (28 features) 19-20% faster even when the config has no parameters in it.

I added a profiling folder (probably should move it to tests?) - It builds a binary that scores the same event lots of times using the fast and non fast variants. It prints the last score (all will be the same). I get the same score with the Fast variant and with the regular SingleRow method:

I didn't upload the model as it's still 14MB (though uploading through git lfs would be ok - it wouldn't add to the git repo weight)

You can repeat the benchmarks yourselves by following the instructions on the last commit message.

Tell me what you think :)

src/c_api.cpp

AlbertoEAF · 2020-07-13T14:21:12Z

Anything left for the merge @guolinke ?

StrikerRUS

@AlbertoEAF Thanks a lot for pushing this PR forward!

Please consider fixing some typos listed below. Also, if it not very hard for you, please replace all single ` with double `` ones for the consistency across the codebase.

include/LightGBM/c_api.h

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

AlbertoEAF · 2020-07-14T15:37:37Z

Changed @StrikerRUS :)

Anything left for merging?

StrikerRUS

@AlbertoEAF Thanks a lot! Awesome as always! 🙂

* Fix bug introduced in PR #2992 for Fast predict * Faster Fast predict API * Add const to SingleRow Fast methods

github-actions · 2023-08-24T12:19:38Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

AlbertoEAF requested review from chivee and guolinke as code owners April 11, 2020 18:42

StrikerRUS added the feature label Apr 11, 2020

AlbertoEAF force-pushed the feat/optimize-single-prediction branch from ce384e9 to 2ab5707 Compare April 18, 2020 22:34

AlbertoEAF requested a review from btrotta as a code owner April 18, 2020 22:34

StrikerRUS mentioned this pull request May 11, 2020

v3.0.0rc1 #3071

Merged

AlbertoEAF force-pushed the feat/optimize-single-prediction branch from 6cd9ee4 to a9eae2b Compare May 15, 2020 09:32

jameslamb added the in progress label May 16, 2020

AlbertoEAF force-pushed the feat/optimize-single-prediction branch from a9eae2b to d24eabe Compare May 23, 2020 12:11

AlbertoEAF requested review from henry0312, huanzhang12, jameslamb, Laurae2, StrikerRUS and wxchan as code owners May 23, 2020 12:40

AlbertoEAF force-pushed the feat/optimize-single-prediction branch from 349e6be to f299506 Compare May 23, 2020 12:52

guolinke reviewed Jul 12, 2020

View reviewed changes

src/c_api.cpp Show resolved Hide resolved

AlbertoEAF added 2 commits July 12, 2020 20:50

Change FastConfig ncol to int32_t.

4813251

Removed profiling folder

e36381e

guolinke approved these changes Jul 13, 2020

View reviewed changes

StrikerRUS requested changes Jul 13, 2020

View reviewed changes

include/LightGBM/c_api.h Outdated Show resolved Hide resolved

include/LightGBM/c_api.h Outdated Show resolved Hide resolved

include/LightGBM/c_api.h Outdated Show resolved Hide resolved

jameslamb removed their request for review July 13, 2020 15:38

AlbertoEAF and others added 4 commits July 14, 2020 16:29

fix doxygen typo include/LightGBM/c_api.h

1062588

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

fix doxygen typo include/LightGBM/c_api.h

3dd7a8e

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

fix doxygen typo include/LightGBM/c_api.h

08a7f92

Co-authored-by: Nikita Titov <nekit94-08@mail.ru>

Doxygen: change new docstrings to double back-quote

d8c37c0

AlbertoEAF requested a review from StrikerRUS July 15, 2020 12:34

StrikerRUS approved these changes Jul 15, 2020

View reviewed changes

StrikerRUS merged commit fc79b36 into microsoft:master Jul 15, 2020

StrikerRUS removed the in progress label Jul 15, 2020

StrikerRUS mentioned this pull request Jul 15, 2020

[performance] Single row predictions speedup. #2935

Closed

StrikerRUS added efficiency and removed feature labels Jul 15, 2020

AlbertoEAF deleted the feat/optimize-single-prediction branch July 16, 2020 11:00

This was referenced Jul 16, 2020

feat: Change locking strategy of Booster, allow for share and unique locks #2760

Merged

[SWIG] removed commented code #3233

Merged

AlbertoEAF mentioned this pull request Jul 20, 2020

[bug] Introduced bug in *Fast() methods with the locking strategy PR #3241

Closed

AlbertoEAF added a commit to AlbertoEAF/LightGBM that referenced this pull request Aug 1, 2020

Fix bug introduced in PR microsoft#2992 for Fast predict

ffed9aa

AlbertoEAF mentioned this pull request Aug 1, 2020

Fast single row predict API v2 #3268

Merged

guolinke pushed a commit that referenced this pull request Aug 5, 2020

Fast single row predict API v2 (#3268)

b5027de

* Fix bug introduced in PR #2992 for Fast predict * Faster Fast predict API * Add const to SingleRow Fast methods

StrikerRUS mentioned this pull request Jun 4, 2022

Which method is recommended for model online inference? #5257

Closed

Ten0 mentioned this pull request Aug 4, 2023

Fix single row prediction performance in a multi-threaded environment #6021

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/optimize single prediction #2992

Feat/optimize single prediction #2992

AlbertoEAF commented Apr 11, 2020 •

edited

Loading

jameslamb commented Apr 13, 2020

AlbertoEAF commented Apr 19, 2020

StrikerRUS commented Apr 19, 2020

AlbertoEAF commented Apr 23, 2020 •

edited

Loading

AlbertoEAF commented Apr 29, 2020

imatiach-msft commented Apr 29, 2020

imatiach-msft commented Apr 29, 2020

imatiach-msft commented Apr 29, 2020

AlbertoEAF commented May 2, 2020 •

edited

Loading

AlbertoEAF commented May 9, 2020 •

edited

Loading

AlbertoEAF commented May 15, 2020

imatiach-msft commented May 15, 2020

AlbertoEAF commented May 15, 2020

AlbertoEAF commented May 23, 2020 •

edited

Loading

AlbertoEAF commented Jul 13, 2020

StrikerRUS left a comment

AlbertoEAF commented Jul 14, 2020

StrikerRUS left a comment

github-actions bot commented Aug 24, 2023

Feat/optimize single prediction #2992

Feat/optimize single prediction #2992

Conversation

AlbertoEAF commented Apr 11, 2020 • edited Loading

jameslamb commented Apr 13, 2020

AlbertoEAF commented Apr 19, 2020

StrikerRUS commented Apr 19, 2020

AlbertoEAF commented Apr 23, 2020 • edited Loading

AlbertoEAF commented Apr 29, 2020

imatiach-msft commented Apr 29, 2020

imatiach-msft commented Apr 29, 2020

imatiach-msft commented Apr 29, 2020

AlbertoEAF commented May 2, 2020 • edited Loading

AlbertoEAF commented May 9, 2020 • edited Loading

AlbertoEAF commented May 15, 2020

imatiach-msft commented May 15, 2020

AlbertoEAF commented May 15, 2020

AlbertoEAF commented May 23, 2020 • edited Loading

REVIEW REQUEST

AlbertoEAF commented Jul 13, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

AlbertoEAF commented Jul 14, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 24, 2023

AlbertoEAF commented Apr 11, 2020 •

edited

Loading

AlbertoEAF commented Apr 23, 2020 •

edited

Loading

AlbertoEAF commented May 2, 2020 •

edited

Loading

AlbertoEAF commented May 9, 2020 •

edited

Loading

AlbertoEAF commented May 23, 2020 •

edited

Loading