Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pred_contribs in 2.1.1 takes significantly more gpu memory than in 1.4.2 #10936

Open
RostislavStoyanov opened this issue Oct 27, 2024 · 6 comments

Comments

@RostislavStoyanov
Copy link

RostislavStoyanov commented Oct 27, 2024

I've noticed that using pred_contribs to generate shap values takes significantly more gpu memory in XGBoost 2.1.1 vs 1.4.2.
This can lead to having issues with generating shap values, where no issue was previously present.

GPU memory comparison:
1.4.2 - 3090
1.7.6 - 4214
2.1.1 - 5366

Short example used to demonstrate:

from typing import Tuple
import subprocess
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, DMatrix

def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    # fetch dataset 
    diabetes_binary = fetch_ucirepo(id=891) 
    # data (as pandas dataframes) 
    X = diabetes_binary.data.features 
    y = diabetes_binary.data.targets 
    return X, y

  
def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # split dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    return X_train, X_test, y_train, y_test 


def train_model(X_train: pd.DataFrame, y_train: pd.DataFrame) -> XGBClassifier:
    # train a model
    xgb_params = {
        "objective": "binary:logistic",
        "n_estimators": 2000,
        "max_depth": 13,
        "learning_rate": 0.1,
        "tree_method": "gpu_hist",
    }
    model = XGBClassifier(**xgb_params)
    model.fit(X_train, y_train["Diabetes_binary"])
    return model


def call_shap_log_usage(model: XGBClassifier, test_data: pd.DataFrame) -> None:
    output_file = open("output.txt", mode='w')
    proc = subprocess.Popen('./log_usage.sh', stdout=output_file, stderr=subprocess.STDOUT)

    booster = model.get_booster()
    booster.set_param({"predictor": "gpu_predictor"})
    dmatrix = DMatrix(test_data)
    shap_values = booster.predict(dmatrix, pred_contribs=True)

    proc.terminate()

    return shap_values

if __name__ == '__main__':
    X, y = download_data()
    X_train, X_test, y_train, y_test = prep_dataset(X, y)
    model = train_model(X_train, y_train)
    call_shap_log_usage(model, X_test)

with the following bash script used for generating memory usage:

#!/bin/bash
a=0
while true; do 
b=$(nvidia-smi --query-gpu=memory.used --format=csv|grep -v memory|awk '{print $1}')
[ $b -gt $a ] && a=$b && echo $a 
sleep .5
done

All tests run on Ubuntu 20.04.6 LTS.
Requirements with only the xgb version (and the device/tree method parameters) being changed between tests:

pandas==1.3.5
numpy==1.22.4
ipykernel==5.5.6
xgboost==1.4.2
scikit-learn==1.3.2
ucimlrepo==0.0.7
@trivialfis
Copy link
Member

Thank you for raising the issue.

I just did a simple test. I think different models and imprecise measuring caused the change in observed output. I generated a sample model using the latest XGBoost and ran the SHAP prediction using the latest and the 1.7 branches. The results from both runs are consistent, with peak memory around 4.8-4.9GB.

Following is the screenshot of nsight-system with the 1.7 branch:

image

@trivialfis
Copy link
Member

A peak memory usage might not be captured by running nvidia-smi periodically. As shown in the screenshot, the memory usage came back down after a spike. One needs to capture that spike to see the actual memory usage correctly.

image

@RostislavStoyanov
Copy link
Author

Thank you for the answer. I will keep this in mind for any future cases.
I will be closing this issue. Once again, sorry for wasting your time and thank you.

@RostislavStoyanov
Copy link
Author

RostislavStoyanov commented Nov 3, 2024

Hi again @trivialfis,

I've rerun the tests based upon your feedback. I do agree that there is no degradation between 1.7.6 and 2.1.2 versions of the library, however, I still do find such degradation between 1.4.2 and later versions.

What I have done is to have a script for training and saving a model and then another script that is profiled, which simply loads the saved model and calculates shap values. Here are the results with scripts also provided below:

-1.4.2 peak:
1_4_2_peak
-1.4.2 sustained:
1_4_2_sustained

-1.7.6 peak:
1_7_6_peak
-1.7.6 sustained:
1_7_6_sustained

-2.1.2 peak:
2_1_2_peak
-2.1.2 sustained:
2_1_2_sustained

And here are the scripts:
Training:

from typing import Tuple
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, __version__ as xgb_version

def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    # fetch dataset
    diabetes_binary = fetch_ucirepo(id=891)
    # data (as pandas dataframes)
    X = diabetes_binary.data.features
    y = diabetes_binary.data.targets
    return X, y


def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # split dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    return X_train, X_test, y_train, y_test


def train_save_model(X_train: pd.DataFrame, y_train: pd.DataFrame) -> XGBClassifier:
    # train a model
    xgb_params = {
        "objective": "binary:logistic",
        "n_estimators": 2000,
        "max_depth": 13,
        "learning_rate": 0.1,
        "tree_method": "gpu_hist",
    }
    model = XGBClassifier(**xgb_params)
    model.fit(X_train, y_train["Diabetes_binary"])
    model.save_model("xgb_model.json")
    return model

if __name__ == '__main__':
    if xgb_version != '1.4.2':
        print("Training only on 1.4.2.")
        exit(1)
    X, y = download_data()
    X_train, X_test, y_train, y_test = prep_dataset(X, y)
    model = train_save_model(X_train, y_train)

Shap calc:

from typing import Tuple
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, DMatrix,  __version__ as xgb_version

def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
    # fetch dataset
    diabetes_binary = fetch_ucirepo(id=891)
    # data (as pandas dataframes)
    X = diabetes_binary.data.features
    y = diabetes_binary.data.targets
    return X, y


def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # split dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

    return X_train, X_test, y_train, y_test


def load_model(file_path: str) -> XGBClassifier:
    model = XGBClassifier()
    model.load_model(file_path)
    return model


def call_shap_values(model: XGBClassifier, test_data: pd.DataFrame) -> pd.DataFrame:
    booster = model.get_booster()
    booster.set_param({"predictor": "gpu_predictor"})
    dmatrix = DMatrix(test_data)
    shap_values = booster.predict(dmatrix, pred_contribs=True)

    shap_values_df = pd.DataFrame(shap_values[:, :-1], columns=test_data.columns)
    shap_values_df["base_value"] = shap_values[:, -1]
    shap_values_df.to_csv(f"shap_values_{xgb_version}.csv", index=False)

    return shap_values_df

if __name__ == '__main__':
    X, y = download_data()
    X_train, X_test, y_train, y_test = prep_dataset(X, y)
    model = load_model("./xgb_model.json")
    calc_save_shap_vals = call_shap_values(model, X_test)

As you can probably see from the screenshots these tests were run against a Win10 machine, as I had some trouble running nsight-system on the remote instance I previously used.

I've found out that this problem appears from as soon as the 1.5.0 version. Looking at the patch notes there is the following sentence "Most of the other features, including prediction, SHAP value computation, feature
importance, and model plotting were revised to natively handle categorical splits.", which might be the origin of the issue?

In any case, please inform me if there is an issue with the testing methodology, I think it is more precise this time.

As a side note, I have another question -- isn't using nsight-system equilvalent (for the purposes of memory usage measurement) to calling nvidia-smi with high enough frequency (like 10kHZ) and logging the results?

@trivialfis
Copy link
Member

Thank you for sharing the info and reminding me of the categorical feature support. Yes, I can confirm the memory usage increase and it is indeed caused by categorical support. Specifically, this member variable

common::CatBitField categories;
It's used in the SHAP trace path, when the tree is deep, this caused a non-trivial amount of memory usage. We might want to make this optional.

I have another question -- isn't using nsight-system equilvalent (for the purposes of memory usage measurement) to calling nvidia-smi with high enough frequency (like 10kHZ) and logging the results?

Probably, the underlying mechanism of event sampling is beyond my knowledge.

@RostislavStoyanov
Copy link
Author

Hi again @trivialfis,

So I've looked into your suggestion for making the categories field optional. I've done this using a raw pointer (which might not be the best approach but more on this later). You can see my changes here. Looking at the results:

peak(Gb) sustained(Gb)
1.4.2 3.64 2.84
2.1.3 5.39 4.18
dev 4.91 3.84

it seems that this change is not enough. I think more modifications will have to be make like perhaps making some of the member variables in TreeView optional.

There are some things I would like to get your opinion on:

  1. I am not really sure how to manage memory in cuda. I initially tried using std::unique_ptr, but that is only available on host side I think. Do you have any recommedations as how this is best handled? Or better yet maybe you could point me to some general CUDA resources to go through as this is my first interaction with cuda code.
  2. I want to look deeper into the memory reduction, but it seems it will be a bigger refactor that previously discussed. Do you think having more optional fields is acceptable? Or maybe some suggestions at what to look at? Ideally, I would like to be able to make some changes that are useful for everybody end up in the main repository, but will very possibly need help at some points, and I am not sure if I can ask this of you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants