Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/optimize single prediction #2992

Merged

Conversation

AlbertoEAF
Copy link
Contributor

@AlbertoEAF AlbertoEAF commented Apr 11, 2020

This follows in sequence of these findings: #2935.

A new struct is added to the C API, along with some methods to leverage it for faster single-row predictions.
There are no impacts to the current API as all changes are new independent additions:

  • The method of interest: LGBM_BoosterPredictForMatSingleRowFast()
  • Auxiliary methods:
    • LGBM_BoosterPredictForMatSingleRowFastInit() - instantiates the FastConfig
    • LGBM_FastConfigFree() - releases the FastConfig

This method splits up set-up from scoring. As it reduces memory allocations it reduces not only the throughtput but latency as well.

It however requires 2 more calls from who is using it, one for setup of FastConfig, and another for its release.

To score one simply passes the FastConfig and not the array data again, as this also "caches" the get_row_fun. Basically to score new instances, one just replaces the data in the single-row features array and makes a new FastPredict call.

I also have a commit for profiling this, which I didn't merge as it would add more files to the repo, but shows how one can score:
AlbertoEAF@d965ba1

Don't know however, if we shouldn't start adding C tests as well. The ones at that branch can be explored for that purpose.

Initially there was just a static FastConfig instantiated inside c_api.cpp which the user of the C APi didn't even know about, but decided to change it to be public, so that many can be managed. I think this is ready to start the review process :)

@jameslamb
Copy link
Collaborator

Hi @AlbertoEAF sorry for the inconvenience, but can you please merge to most recent master? We just merged some fixes for the continuous integration failures we've had recently. It should make your build green.

@AlbertoEAF AlbertoEAF force-pushed the feat/optimize-single-prediction branch from ce384e9 to 2ab5707 Compare April 18, 2020 22:34
@AlbertoEAF
Copy link
Contributor Author

Hello @jameslamb, no problem, but it seems there are still some issues, namely linting in codes I didn't touch. I'm assuming it's best not to touch those as someone else who added that is likely to fix them in the master. I'm not sure when that will be fixed to rebase now though.

@StrikerRUS
Copy link
Collaborator

@AlbertoEAF Hi!

Those linting errors are introduced in this PR.

src/c_api.cpp:1676:  "private:" should be preceded by a blank line  [whitespace/blank_line] [3]
src/c_api.cpp:1699:  At least two spaces is best between code and comments  [whitespace/comments] [2]
src/c_api.cpp:1701:  Closing ) should be moved to the previous line  [whitespace/parens] [2]
Done processing src/c_api.cpp

Please fix them.

Also there are some problems (forgotten argument?) with Doxygen documentation for new functions that should be resolved before starting a review process for this PR.

/home/travis/build/microsoft/LightGBM/include/LightGBM/c_api.h:915: error: argument 'fast_config_handle' of command @param is not found in the argument list of LGBM_BoosterPredictForMatSingleRowFast(FastConfigHandle fastConfig_handle, int predict_type, int num_iteration, int64_t *out_len, double *out_result) (warning treated as error, aborting now)

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Apr 23, 2020

Thanks for the pointer @StrikerRUS! , I misread the linter as if the changes were on another file, my bad.


@imatiach-msft: I want to know if reusing the input features array is a limitation to your use case.

One could choose to pass at every prediction the pointer to the new input features address, but by reusing the same data address (which implies you must overwrite it prior to a new prediction), brings a tiny extra speed up still, and helps latencies by avoiding more allocations (lambda capture followed by get_row_fun's std::function creation).

@AlbertoEAF
Copy link
Contributor Author

@imatiach-msft ping :D

@imatiach-msft
Copy link
Contributor

@AlbertoEAF yes, this is very nice. I don't think re-using the same native data array is a limitation, but I think there is some misunderstanding, as mmlspark uses the methods LGBM_BoosterPredictForCSRSingle for sparse prediction which @eisber added:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala#L247
and LGBM_BoosterPredictForMatSingle for dense prediction:
https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMBooster.scala#L266
I believe these take the native representation from java array directly, so there shouldn't be any memory churn, unless I am misunderstanding something, eg see:
https://github.com/microsoft/LightGBM/blob/master/swig/lightgbmlib.i#L119

    int* indices0 = (int*)jenv->GetPrimitiveArrayCritical(indices, &isCopy);
    double* values0 = (double*)jenv->GetPrimitiveArrayCritical(values, &isCopy);

@imatiach-msft
Copy link
Contributor

also it looks like the current code only addresses the dense case, but we also use CSR predict function for sparse case

@imatiach-msft
Copy link
Contributor

"helps latencies by avoiding more allocations (lambda capture followed by get_row_fun's std::function creation)."

sorry could you explain this a bit more - What is the lambda capture that you are referring to specifically? Would this be an issue for mmlspark or is this only for native-only callers who are not using mmlspark? For mmlspark, there shouldn't be more memory allocated for the input data, just memory passed from java data structures directly that were previously already allocated.

The get_row_fun's std::function creation seems like it could be separate from re-using the same memory for the input data, eg that part could be optimized even if a new array was passed in every time? Maybe I am missing something there.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented May 2, 2020

Hmm interesting @imatiach-msft,

In the Java implementation I'm working on, a C array is initialized at program startup (pointer swigResources.swigInstancePtr), and for every new runtime instance, I get new features from an external source in a custom object Instance, and then copy those values into the C array. I was using:

private void initSWIGInstanceArray(final Instance instance) {

    int skipTargetOffset = 0; // set to 1 after passing the target (if it exists)
    for (int i = 0; i < this.schemaNumFields; ++i) {
        // If the label is not present, targetIndex=-1, thus "if" wont't trigger:
        if (i == this.schemaTargetIndex) {
            skipTargetOffset = -1;
        } else {
            lightgbmlibJNI.doubleArray_setitem(
                    this.swigResources.swigInstancePtr,
                    i + skipTargetOffset,
                    instance.getValue(i)
            );
        }
    }
}

and then use that pointer for the prediction call. Thus, my address is always the same.

To use that function I'd need first to copy to a Java double array and then use that array data (which is then passed to C by GetPrimitiveArrayCritical. Maybe the overhead of copying the data to a Java double array and then using that function relying on GetPrimitiveArrayCritical is smaller when there are many features (given the amount of JNI calls I have to do is proportional to #features)? If it requires a copy though, I'm not sure if it will be slower (2 copies total).

With the GetPrimitiveArrayCritical version there are no guarantees that we won't get a copy (thus a new address), so I guess the address of the feature array data cannot be set only once and reused, right @imatiach-msft ?

When I spoke of the lambda capture & std::function I meant both steps on the C++ side. Clearly the biggest overhead there comes from creating the new std::function to accomodate the new features address.

The get_row_fun's std::function creation seems like it could be separate from re-using the same memory for the input data, eg that part could be optimized even if a new array was passed in every time?

How do you suggest doing that @imatiach-msft ?
I'm not seeing how to replace this:

auto get_row_fun = RowPairFunctionFromDenseMatric(data, 1, ncol, data_type, is_row_major);

Right now a std::function is generated and data is not one of its parameters, we'd have to change all the code that relies on it so we could pass an extra pointer data at the instant of getting the values? And thus also have to keep track of that pointer data when we need to use that function? Doesn't seem very practical to me, it would likely be best to not assume that the address is the same and drop this optimization. Unless you have a better suggestion, to which I'm all ears! :)

Thanks!

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented May 9, 2020

Gentle ping to @imatiach-msft :)

Pinging as well (RFC) @guolinke @StrikerRUS as maybe you guys also have input regarding eliminating the generation of a new std::function RowPairFunctionFromDenseMatric on every prediction with new data addresses :)

Thanks guys, I appreciate any input here :)

@StrikerRUS StrikerRUS mentioned this pull request May 11, 2020
@AlbertoEAF
Copy link
Contributor Author

@imatiach-msft I decided to remove the optimization of reusing the function.

We lose 5% of the 30% speed improvement but now it is usable for my case (generic) and MMLSpark's (with the GetCritical calls).

How do you want to deal with the Java wrappers?

Have the current version in lightgbmlib.i as it requires no separate steps for setup and teardown AND the new one? Or replace that by the scoring call of the new one? (Setup and Teardown can use directly the C API, no need for wrapping there).

Regarding the CSC optimizations I think we can leave that for later? You already use that only for specific cases, the general one is the DenseRow right?
(Adding it should bring no more complexity, I believe the same FastConfig can be used, it's just rinse and repeat).

@AlbertoEAF AlbertoEAF force-pushed the feat/optimize-single-prediction branch from 6cd9ee4 to a9eae2b Compare May 15, 2020 09:32
@imatiach-msft
Copy link
Contributor

@AlbertoEAF you have a lot of questions :)
"We lose 5% of the 30% speed improvement but now it is usable for my case (generic) and MMLSpark's (with the GetCritical calls)."
is there a way we could enable both cases? 5% improvement would still be nice to have.

"To use that function I'd need first to copy to a Java double array and then use that array data"
No, I think the way you are doing it right now is much better, don't copy to a separate Java array just for this.

Do you actually have to predict one row at a time from your program? You could probably improve performance much more by batching, if that was possible in your scenario, putting multiple instances in a single native array and then calling predict once?

"GetPrimitiveArrayCritical version there are no guarantees that we won't get a copy"
That seems like another good reason for why you should do it the way you are doing now and not create a separate Java array just for this, it doesn't really make sense I think if you are trying to really optimize performance? That method for Spark is really optimized just for Spark.

"Right now a std::function is generated and data is not one of its parameters, we'd have to change all the code that relies on it so we could pass an extra pointer data at the instant of getting the values? "
At first I thought yes, then with your hesitation I thought no, but then looking at the code more it seems like this might work, you can pass in data either to the function, or you could pass in a double pointer to data and modify which row it points to between calls, which may make the code more confusing but you wouldn't need to refactor as much:

Option 1:
std::function<std::vector<double>(int row_idx, const void* data)>
RowFunctionFromDenseMatric(int num_row, int num_col, int data_type, int is_row_major) {
Option 2:
std::function<std::vector<double>(int row_idx)>
RowFunctionFromDenseMatric(void** data_ptr, int num_row, int num_col, int data_type, int is_row_major) {

Agree that it might not be worth it if it involves a lot of refactoring. If that actually is hurting performance, it may make sense to refactor it. I would check with @guolinke first though.

Actually, looking through the code more, I see some optimizations that could be made. We are taking in an array, copying it to a vector from index->value, and then in predict function we are again copying it to a buffer by accessing indexes in dense case - but it seems like we could just bypass this and copy the original data array:

CopyToPredictBuffer(predict_buf_[tid].data(), features);

See LGBM_BoosterPredictForMatSingleRow:

auto get_row_fun = RowPairFunctionFromDenseMatric(data, 1, ncol, data_type, is_row_major);

See the code that calls PredictSingleRow and predict_function:
single_row_predictor_[predict_type]->predict_function(one_row, pred_wrt_ptr);

See code which creates the Predict function:
https://github.com/microsoft/LightGBM/blob/master/src/c_api.cpp#L418
See predict_function which is call to GetPredictFunction:
predict_function = predictor_->GetPredictFunction();

This optimization might take too much time to implement though and I'm not sure how much it's worth it to make those complex changes in the code. Conceptually though, there is room for some optimization because the intermediate vector representation isn't really useful in the dense case and it just increases memory churn and time to copy over data, even though in the end it's not even the data structure that is used for prediction.

"Regarding the CSC optimizations I think we can leave that for later? You already use that only for specific cases, the general one is the DenseRow right?"
Sparse scenario is super important, especially with the work I am doing right now with (Microsoft's AzureML) AutoML and their model interpretability feature, where if their data has strings it is almost always featurized into a sparse representation, which is why I'm trying to add a sparse TreeShap implementation here: #3000. That's not related to MMLSpark's LightGBM though, where sparse structures aren't used as often, although sparse scenario is very important in many use-cases - I don't know the exact ratio, I would guess maybe 70 dense/30 sparse percent split?

@AlbertoEAF
Copy link
Contributor Author

especially with the work I am doing right now with (Microsoft's AzureML) AutoML and their model interpretability feature,

Cool stuff!

I hope to have time this week to get back to you on all that you said :p

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented May 23, 2020

Regarding your last comment @imatiach-msft, I agree that optimizing that would be nice, but I'd rather avoid it as passing the original data pointer around further complicates the calls. @guolinke what are your thoughts on that?

Do you actually have to predict one row at a time from your program?

Yes I do :) It's meant for real-time use, as little latency as possible, so I cannot wait for more events.

Sparse scenario is super important
I'll try to add the corresponding sparse methods as well to SWIG then.

The sparse predictors like LGBM_BoosterPredictForCSR take num_col as int64_t, but LGBM_BoosterPredictForMatSingleRow takes int32_t.

As lots of code depends on that I didn't change int64_t to int32_t, so I added an IntUnion to FastConfig, so any type can be used, which meant not changing any existing functions in the C API, although in LGBM_BoosterPredictForCSRSingleRow, there was already a cast of num_col from int64_t to int32_t, further reinforcing the point above.


REVIEW REQUEST

I think the code is done, can you review? @imatiach-msft , @guolinke, @StrikerRUS

I also added the helpers in SWIG for the new dense and sparse variants which rely on the Critical JNI call as well for maximum speed @imatiach-msft ;)

I tested by introducing 2 profiling C++ "tests" with which you can compare the time of the LGBM_BoosterPredictForMatSingleRowFast and LGBM_BoosterPredictForMatSingleRow function calls and scores.

Each .cpp scores the same instance a bunch of times and prints the last call's score. You can see how they are the same, and the Fast variant is with the Higgs model (28 features) 19-20% faster even when the config has no parameters in it.

I added a profiling folder (probably should move it to tests?) - It builds a binary that scores the same event lots of times using the fast and non fast variants. It prints the last score (all will be the same). I get the same score with the Fast variant and with the regular SingleRow method:

image

I didn't upload the model as it's still 14MB (though uploading through git lfs would be ok - it wouldn't add to the git repo weight)

You can repeat the benchmarks yourselves by following the instructions on the last commit message.

Tell me what you think :)

@AlbertoEAF
Copy link
Contributor Author

Anything left for the merge @guolinke ?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlbertoEAF Thanks a lot for pushing this PR forward!

Please consider fixing some typos listed below. Also, if it not very hard for you, please replace all single ` with double `` ones for the consistency across the codebase.

include/LightGBM/c_api.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
include/LightGBM/c_api.h Outdated Show resolved Hide resolved
@jameslamb jameslamb removed their request for review July 13, 2020 15:38
AlbertoEAF and others added 4 commits July 14, 2020 16:29
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
@AlbertoEAF
Copy link
Contributor Author

Changed @StrikerRUS :)

Anything left for merging?

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlbertoEAF Thanks a lot! Awesome as always! 🙂

@StrikerRUS StrikerRUS merged commit fc79b36 into microsoft:master Jul 15, 2020
@AlbertoEAF AlbertoEAF deleted the feat/optimize-single-prediction branch July 16, 2020 11:00
AlbertoEAF added a commit to AlbertoEAF/LightGBM that referenced this pull request Aug 1, 2020
guolinke pushed a commit that referenced this pull request Aug 5, 2020
* Fix bug introduced in PR #2992 for Fast predict

* Faster Fast predict API

* Add const to SingleRow Fast methods
@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants