Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

XGBoost Pipelines

Training pipeline

The XGBoost training pipeline can be found in training/pipeline.py . Within the kubeflow pipeline, train_xgboost_model is the main training component which contains the implementation of an XGB model with scikit-learn preprocessing.This component can then be wrapped in a custom kfp ContainerOp from google-cloud-pipeline-components which submits a Vertex Training job with added flexibility for machine_type, replica_count, accelerator_type among other machine configurations.

The training phase is preceded by a preprocessing phase where different transformations are applied to the training and evaluation data using scikit-learn preprocessing functions. The preprocessing step and the training step define the two components of the Scikit-Learn pipeline as shown in the diagram below.

Training process

Preprocessing with Scikit-learn

The 3 data transformation steps considered in the train.py script are:

Encoder Description Features
StandardScaler() Centering and scaling numerical values dayofweek, hourofday, trip_distance, trip_miles, trip_seconds
OneHotEncoder() Encoding a chosen subset of categorical features as a one-hot numeric array payment_type, new/unknown values in categorial features are represented as zeroes everywhere in the one-hot numeric array
OrdinalEncoder() Encoding a chosen subset of categorical features as an integer array company, new/unknown values in categorical features are assigned to an integer equal to the number of categories for that feature in the training set

More processing steps can be included to the pipeline. For more details, see the official documentation. Ensure that these additional pre-processing steps can handle new/unknown values in test data.

The XGBoost Model

In our example implementation, we have a regression problem of predicting the total fare of a taxi trip in Chicago. Thus, we use XGBRegressor whose hyperparameteres are defined in the variable model_params in the file training/pipeline.py.

Model Hyperparameters

You can specify different hyperparameters through the model_params argument of train_xgboost_model, including:

  • Booster: the type of booster (gbtree is a tree based booster used by default).
  • max_depth: the depth of each tree.
  • Objective: equivalent to the loss function (squared loss, reg:squarederror, is the default).
  • min_split_loss: the minimum loss reduction required to make a further partition on a leaf node of the tree.

More hyperparameters can be used to customize your training. For more details consult the XGBoost documentation

Model artifacts

Two model artifacts are generated when we run the training job:

  • Model.joblib : The model is exported to GCS file as a joblib object.
  • Eval_result : The evaluation metrics are exported to GCS as JSON file.

xgboost_component_model&metrics_artifact

Model test/evaluation

Once the model is trained, it will be used to get challenger predictions for evaluation purposes. In general, the component predict_tensorflow_model which expects a single CSV file to create predictions for test data is implemented in the pipeline. However, if you are working with larger test data, it is more efficient to replace it with a Google prebuilt component, ModelBatchPredictOp, to avoid crash caused by memory overload.

Prediction pipeline

The XGBoost prediction pipeline can be found in prediction/pipeline.py.

Specifically, it starts with extracting data from BigQuery to a CSV file in Google Cloud Storage, followed by a data skew validation, which uses this CSV file in generate_statistics component and compares with the schema in assets folder. Before calling the batch prediction function, it looks for the champion model among all the trained models. Next, it takes input data from BigQuery for batch prediction and outputs a BigQuery table, namely prediction_<model-display-name>_<job-create-time>.