-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve DMatrix creation performance in python #10407
base: master
Are you sure you want to change the base?
Conversation
@arieleiz The current interface uses NumPy's @trivialfis Are you aware of the performance implications of the use of |
Hi @hcho3 ! You are of course correct, I did not describe the issue correctly, and the microbenchmark attached has side-effects that make the results incorrect. After fixing the microbenchmark so that the data layout does not change, so we compare apples-to-apples: Analyzing (b) using a python+native profiler, we see the improvement comes directly from Another small difference is, as you suggest, due to the fact that json is not used for either the array interface nor for the arguments (missing/nthread/data_split_mode). If you are OK with the change in general, I'll update the commit message and fix the microbenchmark. |
Yes, please fix the benchmark. I will defer to @trivialfis to decide whether it's worth having a separate code path to optimize for a specific use case (small matrices). |
Here are the updated numbers comparing apples-to-apples. (the test is repeated 65536//rows times so the test durations are non-trivial).
|
The xgboost python python package serializes numpy arrays as json. This has non trivial overhead for small datasets. This patch optimizes the specific case where the numpy is already in "C" contigous 32-bit floating point format, and has rows*cols<=32768, and loads it directly without the json layer. xgboost/tests/python/microbench_numpy.py: Threads | Rows | Cols | Current (sec) | Optimized (sec) | Ratio 1 | 1 | 1000 | 0.0001921 | 0.0001703 | 88.6% 1 | 4 | 1000 | 0.0001689 | 0.0001437 | 85.1% 1 | 16 | 1000 | 0.0002639 | 0.0002457 | 93.1% 1 | 64 | 1000 | 0.0006843 | 0.0006719 | 98.2% 1 | 256 | 1000 | 0.002611 | 0.002655 | 101.7% 1 | 1024 | 1000 | 0.013 | 0.0126 | 97.0% 1 | 4096 | 1000 | 0.06081 | 0.0593 | 97.5% 1 | 16384 | 1000 | 0.2981 | 0.2974 | 99.8% 2 | 1 | 1000 | 0.0001415 | 0.0001196 | 84.6% 2 | 4 | 1000 | 0.0002155 | 0.0002003 | 93.0% 2 | 16 | 1000 | 0.0002137 | 0.000196 | 91.7% 2 | 64 | 1000 | 0.0005054 | 0.0004855 | 96.1% 2 | 256 | 1000 | 0.001613 | 0.001687 | 104.6% 2 | 1024 | 1000 | 0.007743 | 0.008194 | 105.8% 2 | 4096 | 1000 | 0.03791 | 0.03783 | 99.8% 2 | 16384 | 1000 | 0.2077 | 0.2037 | 98.1% 4 | 1 | 1000 | 0.0001374 | 0.0001237 | 90.0% 4 | 4 | 1000 | 0.0001985 | 0.0001621 | 81.7% 4 | 16 | 1000 | 0.0002266 | 0.0001988 | 87.7% 4 | 64 | 1000 | 0.0005175 | 0.0004775 | 92.3% 4 | 256 | 1000 | 0.00166 | 0.001594 | 96.0% 4 | 1024 | 1000 | 0.008257 | 0.008097 | 98.1% 4 | 4096 | 1000 | 0.03492 | 0.0354 | 101.4% 4 | 16384 | 1000 | 0.1896 | 0.1897 | 100.0% 8 | 1 | 1000 | 0.0001471 | 0.0001254 | 85.3% 8 | 4 | 1000 | 0.0003609 | 0.000326 | 90.4% 8 | 16 | 1000 | 0.0002651 | 0.0002217 | 83.6% 8 | 64 | 1000 | 0.0003504 | 0.0003064 | 87.5% 8 | 256 | 1000 | 0.0008264 | 0.0008729 | 105.6% 8 | 1024 | 1000 | 0.003367 | 0.003127 | 92.9% 8 | 4096 | 1000 | 0.01932 | 0.01799 | 93.1% 8 | 16384 | 1000 | 0.1245 | 0.1208 | 97.0%
I don't mind the diverging code path in general. Many users have asked for a more robust implementation of streaming-based prediction. The question for me is whether the code is divergent enough. For instance, there are dedicated inference libraries with special optimizations for getting faster inference with tree models in general, like the FIL project in cuML. They work on both random forest and boosted trees and potentially many other types of tree-based models. Does XGBoost want to complete in that space? If so, we might start building a set of APIs specifically for such prediction use cases, which can be optimized to its teeth. To provide some examples, we can bypass the JSON object, bypass the type dispatching, bypass the memory allocation in the predictor, specialize for dense data, specialize for balanced trees, etc. All are low-hanging fruits. If not, then we can leave the predictor focusing on batch-based data. |
Lastly, we have in-place prediction. |
The xgboost python python package serializes numpy arrays as json. This can take up a considerable amount of time in production workloads. This patch optimizes the specific case where the numpy array is already in "C" contiguous 32-bit floating point format, and can be loaded directly without the json layer. This can improve performance up to 35% in some cases, as can be seen by the microbenchmark added in xgboost/tests/python/microbench_numpy.py:
The pull request contains updated tests as well.