Improve DMatrix creation performance in python #10407

arieleiz · 2024-06-10T20:48:44Z

The xgboost python python package serializes numpy arrays as json. This can take up a considerable amount of time in production workloads. This patch optimizes the specific case where the numpy array is already in "C" contiguous 32-bit floating point format, and can be loaded directly without the json layer. This can improve performance up to 35% in some cases, as can be seen by the microbenchmark added in xgboost/tests/python/microbench_numpy.py:

Rows     | Cols     | Threads      | Contiguous      | Non-contiguous  | Ratio
---------+----------+--------------+-----------------+-----------------+--------------
   15000 |      100 |            0 |         0.01686 |         0.01988 |        84.8%
   15000 |      100 |            1 |         0.02897 |         0.04424 |        65.5%
   15000 |      100 |            2 |         0.02579 |          0.0392 |        65.8%
   15000 |      100 |           10 |         0.01581 |         0.02058 |        76.8%
---------+----------+--------------+-----------------+-----------------+--------------
       2 |     2000 |            0 |        0.001055 |        0.001205 |        87.6%
       2 |     2000 |            1 |       0.0004465 |       0.0005689 |        78.5%
       2 |     2000 |            2 |       0.0004609 |        0.000615 |        74.9%
       2 |     2000 |           10 |       0.0005087 |       0.0005623 |        90.5%
---------+----------+--------------+-----------------+-----------------+--------------

The pull request contains updated tests as well.

hcho3 · 2024-06-10T20:55:14Z

@arieleiz The current interface uses NumPy's __array_interface__, which should be equivalent to passing the pointer handle. See https://numpy.org/doc/stable/reference/arrays.interface.html. The content of the matrix is not being copied or serialized; only the memory address gets copied. I'm not sure where the 35% difference is coming from.

@trivialfis Are you aware of the performance implications of the use of __array_interface__ ? Or it might be that the JSON parser is introducing significant overhead.

arieleiz · 2024-06-10T21:20:55Z

Hi @hcho3 !

You are of course correct, I did not describe the issue correctly, and the microbenchmark attached has side-effects that make the results incorrect.

After fixing the microbenchmark so that the data layout does not change, so we compare apples-to-apples:
a. there are no significant change when data sizes are very large (not our production use cases)
b. for smaller data (2 rows of 1500 cols, simulating what we have in production), we see a consistent 22% improvement in 1 thread and 50% improvement in 2 threads.

Analyzing (b) using a python+native profiler, we see the improvement comes directly from _from_numpy_array(), and digging deeper the use of DenseAdapterBatch vs. ArrayAdapterBatch.

Another small difference is, as you suggest, due to the fact that json is not used for either the array interface nor for the arguments (missing/nthread/data_split_mode).

If you are OK with the change in general, I'll update the commit message and fix the microbenchmark.

hcho3 · 2024-06-10T21:24:31Z

Yes, please fix the benchmark. I will defer to @trivialfis to decide whether it's worth having a separate code path to optimize for a specific use case (small matrices).

arieleiz · 2024-06-10T21:52:29Z

@hcho3

Here are the updated numbers comparing apples-to-apples. (the test is repeated 65536//rows times so the test durations are non-trivial).
I've updated the code to do the optimizations if the total data size is <= 32768 floats.

Threads  | Rows     | Cols     | Current (sec)   | Optimized (sec) | Ratio
       1 |        1 |     1000 |       0.0001921 |       0.0001703 |        88.6%
       1 |        4 |     1000 |       0.0001689 |       0.0001437 |        85.1%
       1 |       16 |     1000 |       0.0002639 |       0.0002457 |        93.1%
       1 |       64 |     1000 |       0.0006843 |       0.0006719 |        98.2%
       1 |      256 |     1000 |        0.002611 |        0.002655 |       101.7%
       1 |     1024 |     1000 |           0.013 |          0.0126 |        97.0%
       1 |     4096 |     1000 |         0.06081 |          0.0593 |        97.5%
       1 |    16384 |     1000 |          0.2981 |          0.2974 |        99.8%
       2 |        1 |     1000 |       0.0001415 |       0.0001196 |        84.6%
       2 |        4 |     1000 |       0.0002155 |       0.0002003 |        93.0%
       2 |       16 |     1000 |       0.0002137 |        0.000196 |        91.7%
       2 |       64 |     1000 |       0.0005054 |       0.0004855 |        96.1%
       2 |      256 |     1000 |        0.001613 |        0.001687 |       104.6%
       2 |     1024 |     1000 |        0.007743 |        0.008194 |       105.8%
       2 |     4096 |     1000 |         0.03791 |         0.03783 |        99.8%
       2 |    16384 |     1000 |          0.2077 |          0.2037 |        98.1%
       4 |        1 |     1000 |       0.0001374 |       0.0001237 |        90.0%
       4 |        4 |     1000 |       0.0001985 |       0.0001621 |        81.7%
       4 |       16 |     1000 |       0.0002266 |       0.0001988 |        87.7%
       4 |       64 |     1000 |       0.0005175 |       0.0004775 |        92.3%
       4 |      256 |     1000 |         0.00166 |        0.001594 |        96.0%
       4 |     1024 |     1000 |        0.008257 |        0.008097 |        98.1%
       4 |     4096 |     1000 |         0.03492 |          0.0354 |       101.4%
       4 |    16384 |     1000 |          0.1896 |          0.1897 |       100.0%
       8 |        1 |     1000 |       0.0001471 |       0.0001254 |        85.3%
       8 |        4 |     1000 |       0.0003609 |        0.000326 |        90.4%
       8 |       16 |     1000 |       0.0002651 |       0.0002217 |        83.6%
       8 |       64 |     1000 |       0.0003504 |       0.0003064 |        87.5%
       8 |      256 |     1000 |       0.0008264 |       0.0008729 |       105.6%
       8 |     1024 |     1000 |        0.003367 |        0.003127 |        92.9%
       8 |     4096 |     1000 |         0.01932 |         0.01799 |        93.1%
       8 |    16384 |     1000 |          0.1245 |          0.1208 |        97.0%

The xgboost python python package serializes numpy arrays as json. This has non trivial overhead for small datasets. This patch optimizes the specific case where the numpy is already in "C" contigous 32-bit floating point format, and has rows*cols<=32768, and loads it directly without the json layer. xgboost/tests/python/microbench_numpy.py: Threads | Rows | Cols | Current (sec) | Optimized (sec) | Ratio 1 | 1 | 1000 | 0.0001921 | 0.0001703 | 88.6% 1 | 4 | 1000 | 0.0001689 | 0.0001437 | 85.1% 1 | 16 | 1000 | 0.0002639 | 0.0002457 | 93.1% 1 | 64 | 1000 | 0.0006843 | 0.0006719 | 98.2% 1 | 256 | 1000 | 0.002611 | 0.002655 | 101.7% 1 | 1024 | 1000 | 0.013 | 0.0126 | 97.0% 1 | 4096 | 1000 | 0.06081 | 0.0593 | 97.5% 1 | 16384 | 1000 | 0.2981 | 0.2974 | 99.8% 2 | 1 | 1000 | 0.0001415 | 0.0001196 | 84.6% 2 | 4 | 1000 | 0.0002155 | 0.0002003 | 93.0% 2 | 16 | 1000 | 0.0002137 | 0.000196 | 91.7% 2 | 64 | 1000 | 0.0005054 | 0.0004855 | 96.1% 2 | 256 | 1000 | 0.001613 | 0.001687 | 104.6% 2 | 1024 | 1000 | 0.007743 | 0.008194 | 105.8% 2 | 4096 | 1000 | 0.03791 | 0.03783 | 99.8% 2 | 16384 | 1000 | 0.2077 | 0.2037 | 98.1% 4 | 1 | 1000 | 0.0001374 | 0.0001237 | 90.0% 4 | 4 | 1000 | 0.0001985 | 0.0001621 | 81.7% 4 | 16 | 1000 | 0.0002266 | 0.0001988 | 87.7% 4 | 64 | 1000 | 0.0005175 | 0.0004775 | 92.3% 4 | 256 | 1000 | 0.00166 | 0.001594 | 96.0% 4 | 1024 | 1000 | 0.008257 | 0.008097 | 98.1% 4 | 4096 | 1000 | 0.03492 | 0.0354 | 101.4% 4 | 16384 | 1000 | 0.1896 | 0.1897 | 100.0% 8 | 1 | 1000 | 0.0001471 | 0.0001254 | 85.3% 8 | 4 | 1000 | 0.0003609 | 0.000326 | 90.4% 8 | 16 | 1000 | 0.0002651 | 0.0002217 | 83.6% 8 | 64 | 1000 | 0.0003504 | 0.0003064 | 87.5% 8 | 256 | 1000 | 0.0008264 | 0.0008729 | 105.6% 8 | 1024 | 1000 | 0.003367 | 0.003127 | 92.9% 8 | 4096 | 1000 | 0.01932 | 0.01799 | 93.1% 8 | 16384 | 1000 | 0.1245 | 0.1208 | 97.0%

trivialfis · 2024-06-11T07:07:52Z

I don't mind the diverging code path in general. Many users have asked for a more robust implementation of streaming-based prediction. The question for me is whether the code is divergent enough. For instance, there are dedicated inference libraries with special optimizations for getting faster inference with tree models in general, like the FIL project in cuML. They work on both random forest and boosted trees and potentially many other types of tree-based models.

Does XGBoost want to complete in that space? If so, we might start building a set of APIs specifically for such prediction use cases, which can be optimized to its teeth. To provide some examples, we can bypass the JSON object, bypass the type dispatching, bypass the memory allocation in the predictor, specialize for dense data, specialize for balanced trees, etc. All are low-hanging fruits.

If not, then we can leave the predictor focusing on batch-based data.

trivialfis · 2024-06-11T07:16:17Z

Lastly, we have in-place prediction. Booster.inplace_predict that doesn't require the construction of DMatrix.

arieleiz force-pushed the master branch from 2e3adc3 to 3ee4695 Compare June 10, 2024 21:49

arieleiz force-pushed the master branch from 3ee4695 to cdac476 Compare June 11, 2024 06:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DMatrix creation performance in python #10407

Improve DMatrix creation performance in python #10407

arieleiz commented Jun 10, 2024

hcho3 commented Jun 10, 2024 •

edited

Loading

arieleiz commented Jun 10, 2024 •

edited

Loading

hcho3 commented Jun 10, 2024

arieleiz commented Jun 10, 2024

trivialfis commented Jun 11, 2024 •

edited

Loading

trivialfis commented Jun 11, 2024 •

edited

Loading

Improve DMatrix creation performance in python #10407

Are you sure you want to change the base?

Improve DMatrix creation performance in python #10407

Conversation

arieleiz commented Jun 10, 2024

hcho3 commented Jun 10, 2024 • edited Loading

arieleiz commented Jun 10, 2024 • edited Loading

hcho3 commented Jun 10, 2024

arieleiz commented Jun 10, 2024

trivialfis commented Jun 11, 2024 • edited Loading

trivialfis commented Jun 11, 2024 • edited Loading

hcho3 commented Jun 10, 2024 •

edited

Loading

arieleiz commented Jun 10, 2024 •

edited

Loading

trivialfis commented Jun 11, 2024 •

edited

Loading

trivialfis commented Jun 11, 2024 •

edited

Loading