Pred_contribs in 2.1.1 takes significantly more gpu memory than in 1.4.2 #10936

RostislavStoyanov · 2024-10-27T18:37:27Z

I've noticed that using pred_contribs to generate shap values takes significantly more gpu memory in XGBoost 2.1.1 vs 1.4.2.
This can lead to having issues with generating shap values, where no issue was previously present.

GPU memory comparison:
1.4.2 - 3090
1.7.6 - 4214
2.1.1 - 5366

Short example used to demonstrate:

from typing import Tuple
import subprocess
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, DMatrix

def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    # fetch dataset 
    diabetes_binary = fetch_ucirepo(id=891) 
    # data (as pandas dataframes) 
    X = diabetes_binary.data.features 
    y = diabetes_binary.data.targets 
    return X, y

  
def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # split dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    return X_train, X_test, y_train, y_test 


def train_model(X_train: pd.DataFrame, y_train: pd.DataFrame) -> XGBClassifier:
    # train a model
    xgb_params = {
        "objective": "binary:logistic",
        "n_estimators": 2000,
        "max_depth": 13,
        "learning_rate": 0.1,
        "tree_method": "gpu_hist",
    }
    model = XGBClassifier(**xgb_params)
    model.fit(X_train, y_train["Diabetes_binary"])
    return model


def call_shap_log_usage(model: XGBClassifier, test_data: pd.DataFrame) -> None:
    output_file = open("output.txt", mode='w')
    proc = subprocess.Popen('./log_usage.sh', stdout=output_file, stderr=subprocess.STDOUT)

    booster = model.get_booster()
    booster.set_param({"predictor": "gpu_predictor"})
    dmatrix = DMatrix(test_data)
    shap_values = booster.predict(dmatrix, pred_contribs=True)

    proc.terminate()

    return shap_values

if __name__ == '__main__':
    X, y = download_data()
    X_train, X_test, y_train, y_test = prep_dataset(X, y)
    model = train_model(X_train, y_train)
    call_shap_log_usage(model, X_test)

with the following bash script used for generating memory usage:

#!/bin/bash
a=0
while true; do 
b=$(nvidia-smi --query-gpu=memory.used --format=csv|grep -v memory|awk '{print $1}')
[ $b -gt $a ] && a=$b && echo $a 
sleep .5
done

All tests run on Ubuntu 20.04.6 LTS.
Requirements with only the xgb version (and the device/tree method parameters) being changed between tests:

pandas==1.3.5
numpy==1.22.4
ipykernel==5.5.6
xgboost==1.4.2
scikit-learn==1.3.2
ucimlrepo==0.0.7

The text was updated successfully, but these errors were encountered:

trivialfis · 2024-10-28T07:14:58Z

Thank you for raising the issue.

I just did a simple test. I think different models and imprecise measuring caused the change in observed output. I generated a sample model using the latest XGBoost and ran the SHAP prediction using the latest and the 1.7 branches. The results from both runs are consistent, with peak memory around 4.8-4.9GB.

Following is the screenshot of nsight-system with the 1.7 branch:

trivialfis · 2024-10-28T07:17:09Z

A peak memory usage might not be captured by running nvidia-smi periodically. As shown in the screenshot, the memory usage came back down after a spike. One needs to capture that spike to see the actual memory usage correctly.

RostislavStoyanov · 2024-10-28T08:54:50Z

Thank you for the answer. I will keep this in mind for any future cases.
I will be closing this issue. Once again, sorry for wasting your time and thank you.

RostislavStoyanov · 2024-11-03T11:47:40Z

Hi again @trivialfis,

I've rerun the tests based upon your feedback. I do agree that there is no degradation between 1.7.6 and 2.1.2 versions of the library, however, I still do find such degradation between 1.4.2 and later versions.

What I have done is to have a script for training and saving a model and then another script that is profiled, which simply loads the saved model and calculates shap values. Here are the results with scripts also provided below:

-1.4.2 peak:

-1.4.2 sustained:

-1.7.6 peak:

-1.7.6 sustained:

-2.1.2 peak:

-2.1.2 sustained:

And here are the scripts:
Training:

from typing import Tuple
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, __version__ as xgb_version

def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    # fetch dataset
    diabetes_binary = fetch_ucirepo(id=891)
    # data (as pandas dataframes)
    X = diabetes_binary.data.features
    y = diabetes_binary.data.targets
    return X, y


def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # split dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    return X_train, X_test, y_train, y_test


def train_save_model(X_train: pd.DataFrame, y_train: pd.DataFrame) -> XGBClassifier:
    # train a model
    xgb_params = {
        "objective": "binary:logistic",
        "n_estimators": 2000,
        "max_depth": 13,
        "learning_rate": 0.1,
        "tree_method": "gpu_hist",
    }
    model = XGBClassifier(**xgb_params)
    model.fit(X_train, y_train["Diabetes_binary"])
    model.save_model("xgb_model.json")
    return model

if __name__ == '__main__':
    if xgb_version != '1.4.2':
        print("Training only on 1.4.2.")
        exit(1)
    X, y = download_data()
    X_train, X_test, y_train, y_test = prep_dataset(X, y)
    model = train_save_model(X_train, y_train)

Shap calc:

from typing import Tuple
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, DMatrix,  __version__ as xgb_version

def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    # fetch dataset
    diabetes_binary = fetch_ucirepo(id=891)
    # data (as pandas dataframes)
    X = diabetes_binary.data.features
    y = diabetes_binary.data.targets
    return X, y


def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # split dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    return X_train, X_test, y_train, y_test


def load_model(file_path: str) -> XGBClassifier:
    model = XGBClassifier()
    model.load_model(file_path)
    return model


def call_shap_values(model: XGBClassifier, test_data: pd.DataFrame) -> pd.DataFrame:
    booster = model.get_booster()
    booster.set_param({"predictor": "gpu_predictor"})
    dmatrix = DMatrix(test_data)
    shap_values = booster.predict(dmatrix, pred_contribs=True)

    shap_values_df = pd.DataFrame(shap_values[:, :-1], columns=test_data.columns)
    shap_values_df["base_value"] = shap_values[:, -1]
    shap_values_df.to_csv(f"shap_values_{xgb_version}.csv", index=False)

    return shap_values_df

if __name__ == '__main__':
    X, y = download_data()
    X_train, X_test, y_train, y_test = prep_dataset(X, y)
    model = load_model("./xgb_model.json")
    calc_save_shap_vals = call_shap_values(model, X_test)

As you can probably see from the screenshots these tests were run against a Win10 machine, as I had some trouble running nsight-system on the remote instance I previously used.

I've found out that this problem appears from as soon as the 1.5.0 version. Looking at the patch notes there is the following sentence "Most of the other features, including prediction, SHAP value computation, feature
importance, and model plotting were revised to natively handle categorical splits.", which might be the origin of the issue?

In any case, please inform me if there is an issue with the testing methodology, I think it is more precise this time.

As a side note, I have another question -- isn't using nsight-system equilvalent (for the purposes of memory usage measurement) to calling nvidia-smi with high enough frequency (like 10kHZ) and logging the results?

trivialfis · 2024-11-06T11:04:43Z

Thank you for sharing the info and reminding me of the categorical feature support. Yes, I can confirm the memory usage increase and it is indeed caused by categorical support. Specifically, this member variable

xgboost/src/predictor/gpu_predictor.cu

Line 427 in 197c0ae

common::CatBitField categories;

It's used in the SHAP trace path, when the tree is deep, this caused a non-trivial amount of memory usage. We might want to make this optional.

I have another question -- isn't using nsight-system equilvalent (for the purposes of memory usage measurement) to calling nvidia-smi with high enough frequency (like 10kHZ) and logging the results?

Probably, the underlying mechanism of event sampling is beyond my knowledge.

RostislavStoyanov · 2024-12-15T19:51:42Z

Hi again @trivialfis,

So I've looked into your suggestion for making the categories field optional. I've done this using a raw pointer (which might not be the best approach but more on this later). You can see my changes here. Looking at the results:

	peak(Gb)	sustained(Gb)
1.4.2	3.64	2.84
2.1.3	5.39	4.18
dev	4.91	3.84

it seems that this change is not enough. I think more modifications will have to be make like perhaps making some of the member variables in TreeView optional.

There are some things I would like to get your opinion on:

I am not really sure how to manage memory in cuda. I initially tried using std::unique_ptr, but that is only available on host side I think. Do you have any recommedations as how this is best handled? Or better yet maybe you could point me to some general CUDA resources to go through as this is my first interaction with cuda code.
I want to look deeper into the memory reduction, but it seems it will be a bigger refactor that previously discussed. Do you think having more optional fields is acceptable? Or maybe some suggestions at what to look at? Ideally, I would like to be able to make some changes that are useful for everybody end up in the main repository, but will very possibly need help at some points, and I am not sure if I can ask this of you.

RostislavStoyanov closed this as completed Oct 28, 2024

RostislavStoyanov reopened this Nov 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pred_contribs in 2.1.1 takes significantly more gpu memory than in 1.4.2 #10936

Pred_contribs in 2.1.1 takes significantly more gpu memory than in 1.4.2 #10936

RostislavStoyanov commented Oct 27, 2024 •

edited

Loading

trivialfis commented Oct 28, 2024

trivialfis commented Oct 28, 2024

RostislavStoyanov commented Oct 28, 2024

RostislavStoyanov commented Nov 3, 2024 •

edited by trivialfis

Loading

trivialfis commented Nov 6, 2024

RostislavStoyanov commented Dec 15, 2024

Pred_contribs in 2.1.1 takes significantly more gpu memory than in 1.4.2 #10936

Pred_contribs in 2.1.1 takes significantly more gpu memory than in 1.4.2 #10936

Comments

RostislavStoyanov commented Oct 27, 2024 • edited Loading

trivialfis commented Oct 28, 2024

trivialfis commented Oct 28, 2024

RostislavStoyanov commented Oct 28, 2024

RostislavStoyanov commented Nov 3, 2024 • edited by trivialfis Loading

trivialfis commented Nov 6, 2024

RostislavStoyanov commented Dec 15, 2024

RostislavStoyanov commented Oct 27, 2024 •

edited

Loading

RostislavStoyanov commented Nov 3, 2024 •

edited by trivialfis

Loading