-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/optimize single prediction #2992
Feat/optimize single prediction #2992
Conversation
Hi @AlbertoEAF sorry for the inconvenience, but can you please merge to most recent |
ce384e9
to
2ab5707
Compare
Hello @jameslamb, no problem, but it seems there are still some issues, namely linting in codes I didn't touch. I'm assuming it's best not to touch those as someone else who added that is likely to fix them in the master. I'm not sure when that will be fixed to rebase now though. |
@AlbertoEAF Hi! Those linting errors are introduced in this PR.
Please fix them. Also there are some problems (forgotten argument?) with Doxygen documentation for new functions that should be resolved before starting a review process for this PR.
|
Thanks for the pointer @StrikerRUS! , I misread the linter as if the changes were on another file, my bad. @imatiach-msft: I want to know if reusing the input features array is a limitation to your use case. One could choose to pass at every prediction the pointer to the new input features address, but by reusing the same data address (which implies you must overwrite it prior to a new prediction), brings a tiny extra speed up still, and helps latencies by avoiding more allocations (lambda capture followed by get_row_fun's std::function creation). |
@imatiach-msft ping :D |
@AlbertoEAF yes, this is very nice. I don't think re-using the same native data array is a limitation, but I think there is some misunderstanding, as mmlspark uses the methods LGBM_BoosterPredictForCSRSingle for sparse prediction which @eisber added:
|
also it looks like the current code only addresses the dense case, but we also use CSR predict function for sparse case |
"helps latencies by avoiding more allocations (lambda capture followed by get_row_fun's std::function creation)." sorry could you explain this a bit more - What is the lambda capture that you are referring to specifically? Would this be an issue for mmlspark or is this only for native-only callers who are not using mmlspark? For mmlspark, there shouldn't be more memory allocated for the input data, just memory passed from java data structures directly that were previously already allocated. The get_row_fun's std::function creation seems like it could be separate from re-using the same memory for the input data, eg that part could be optimized even if a new array was passed in every time? Maybe I am missing something there. |
Hmm interesting @imatiach-msft, In the Java implementation I'm working on, a C array is initialized at program startup (pointer private void initSWIGInstanceArray(final Instance instance) {
int skipTargetOffset = 0; // set to 1 after passing the target (if it exists)
for (int i = 0; i < this.schemaNumFields; ++i) {
// If the label is not present, targetIndex=-1, thus "if" wont't trigger:
if (i == this.schemaTargetIndex) {
skipTargetOffset = -1;
} else {
lightgbmlibJNI.doubleArray_setitem(
this.swigResources.swigInstancePtr,
i + skipTargetOffset,
instance.getValue(i)
);
}
}
} and then use that pointer for the prediction call. Thus, my address is always the same. To use that function I'd need first to copy to a Java double array and then use that array data (which is then passed to C by With the When I spoke of the lambda capture & std::function I meant both steps on the C++ side. Clearly the biggest overhead there comes from creating the new std::function to accomodate the new features address.
How do you suggest doing that @imatiach-msft ? Line 1640 in 7076cb8
Right now a std::function is generated and Thanks! |
Gentle ping to @imatiach-msft :) Pinging as well (RFC) @guolinke @StrikerRUS as maybe you guys also have input regarding eliminating the generation of a new std::function Thanks guys, I appreciate any input here :) |
@imatiach-msft I decided to remove the optimization of reusing the function. We lose 5% of the 30% speed improvement but now it is usable for my case (generic) and MMLSpark's (with the GetCritical calls). How do you want to deal with the Java wrappers? Have the current version in lightgbmlib.i as it requires no separate steps for setup and teardown AND the new one? Or replace that by the scoring call of the new one? (Setup and Teardown can use directly the C API, no need for wrapping there). Regarding the CSC optimizations I think we can leave that for later? You already use that only for specific cases, the general one is the DenseRow right? |
6cd9ee4
to
a9eae2b
Compare
@AlbertoEAF you have a lot of questions :) "To use that function I'd need first to copy to a Java double array and then use that array data" Do you actually have to predict one row at a time from your program? You could probably improve performance much more by batching, if that was possible in your scenario, putting multiple instances in a single native array and then calling predict once? "GetPrimitiveArrayCritical version there are no guarantees that we won't get a copy" "Right now a std::function is generated and data is not one of its parameters, we'd have to change all the code that relies on it so we could pass an extra pointer data at the instant of getting the values? "
Agree that it might not be worth it if it involves a lot of refactoring. If that actually is hurting performance, it may make sense to refactor it. I would check with @guolinke first though. Actually, looking through the code more, I see some optimizations that could be made. We are taking in an array, copying it to a vector from index->value, and then in predict function we are again copying it to a buffer by accessing indexes in dense case - but it seems like we could just bypass this and copy the original data array: LightGBM/src/application/predictor.hpp Line 107 in 0aa7bfe
See LGBM_BoosterPredictForMatSingleRow: Line 1640 in 7076cb8
See the code that calls PredictSingleRow and predict_function: Line 380 in 7076cb8
See code which creates the Predict function: https://github.com/microsoft/LightGBM/blob/master/src/c_api.cpp#L418 See predict_function which is call to GetPredictFunction: Line 77 in 7076cb8
This optimization might take too much time to implement though and I'm not sure how much it's worth it to make those complex changes in the code. Conceptually though, there is room for some optimization because the intermediate vector representation isn't really useful in the dense case and it just increases memory churn and time to copy over data, even though in the end it's not even the data structure that is used for prediction. "Regarding the CSC optimizations I think we can leave that for later? You already use that only for specific cases, the general one is the DenseRow right?" |
Cool stuff! I hope to have time this week to get back to you on all that you said :p |
a9eae2b
to
d24eabe
Compare
349e6be
to
f299506
Compare
Regarding your last comment @imatiach-msft, I agree that optimizing that would be nice, but I'd rather avoid it as passing the original data pointer around further complicates the calls. @guolinke what are your thoughts on that?
Yes I do :) It's meant for real-time use, as little latency as possible, so I cannot wait for more events.
The sparse predictors like As lots of code depends on that I didn't change int64_t to int32_t, so I added an IntUnion to FastConfig, so any type can be used, which meant not changing any existing functions in the C API, although in LGBM_BoosterPredictForCSRSingleRow, there was already a cast of num_col from int64_t to int32_t, further reinforcing the point above. REVIEW REQUESTI think the code is done, can you review? @imatiach-msft , @guolinke, @StrikerRUS I also added the helpers in SWIG for the new dense and sparse variants which rely on the Critical JNI call as well for maximum speed @imatiach-msft ;) I tested by introducing 2 profiling C++ "tests" with which you can compare the time of the Each .cpp scores the same instance a bunch of times and prints the last call's score. You can see how they are the same, and the Fast variant is with the Higgs model (28 features) 19-20% faster even when the config has no parameters in it. I added a profiling folder (probably should move it to tests?) - It builds a binary that scores the same event lots of times using the fast and non fast variants. It prints the last score (all will be the same). I get the same score with the Fast variant and with the regular SingleRow method: I didn't upload the model as it's still 14MB (though uploading through git lfs would be ok - it wouldn't add to the git repo weight) You can repeat the benchmarks yourselves by following the instructions on the last commit message. Tell me what you think :) |
Anything left for the merge @guolinke ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlbertoEAF Thanks a lot for pushing this PR forward!
Please consider fixing some typos listed below. Also, if it not very hard for you, please replace all single `
with double ``
ones for the consistency across the codebase.
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Co-authored-by: Nikita Titov <nekit94-08@mail.ru>
Changed @StrikerRUS :) Anything left for merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AlbertoEAF Thanks a lot! Awesome as always! 🙂
* Fix bug introduced in PR #2992 for Fast predict * Faster Fast predict API * Add const to SingleRow Fast methods
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
This follows in sequence of these findings: #2935.
A new struct is added to the C API, along with some methods to leverage it for faster single-row predictions.
There are no impacts to the current API as all changes are new independent additions:
LGBM_BoosterPredictForMatSingleRowFast()
LGBM_BoosterPredictForMatSingleRowFastInit()
- instantiates theFastConfig
LGBM_FastConfigFree()
- releases theFastConfig
This method splits up set-up from scoring. As it reduces memory allocations it reduces not only the throughtput but latency as well.
It however requires 2 more calls from who is using it, one for setup of
FastConfig
, and another for its release.To score one simply passes the
FastConfig
and not the array data again, as this also "caches" theget_row_fun
. Basically to score new instances, one just replaces the data in the single-row features array and makes a new FastPredict call.I also have a commit for profiling this, which I didn't merge as it would add more files to the repo, but shows how one can score:
AlbertoEAF@d965ba1
Don't know however, if we shouldn't start adding C tests as well. The ones at that branch can be explored for that purpose.
Initially there was just a static
FastConfig
instantiated insidec_api.cpp
which the user of the C APi didn't even know about, but decided to change it to be public, so that many can be managed. I think this is ready to start the review process :)