The XGBoost training pipeline can be found in training/pipeline.py
. Within the kubeflow pipeline, train_xgboost_model
is the main training component which contains the implementation of an XGB model with scikit-learn
preprocessing.This component can then be wrapped in a custom kfp ContainerOp from google-cloud-pipeline-components
which submits a Vertex Training job with added flexibility for machine_type
, replica_count
, accelerator_type
among other machine configurations.
The training phase is preceded by a preprocessing phase where different transformations are applied to the training and evaluation data using scikit-learn preprocessing functions. The preprocessing step and the training step define the two components of the Scikit-Learn pipeline as shown in the diagram below.
The 3 data transformation steps considered in the train.py
script are:
Encoder | Description | Features |
---|---|---|
StandardScaler() | Centering and scaling numerical values | dayofweek , hourofday , trip_distance , trip_miles , trip_seconds |
OneHotEncoder() | Encoding a chosen subset of categorical features as a one-hot numeric array | payment_type , new/unknown values in categorial features are represented as zeroes everywhere in the one-hot numeric array |
OrdinalEncoder() | Encoding a chosen subset of categorical features as an integer array | company , new/unknown values in categorical features are assigned to an integer equal to the number of categories for that feature in the training set |
More processing steps can be included to the pipeline. For more details, see the official documentation. Ensure that these additional pre-processing steps can handle new/unknown values in test data.
In our example implementation, we have a regression problem of predicting the total fare of a taxi trip in Chicago. Thus, we use XGBRegressor whose hyperparameteres are defined in the variable model_params
in the file training/pipeline.py.
You can specify different hyperparameters through the model_params
argument of train_xgboost_model
, including:
Booster
: the type of booster (gbtree
is a tree based booster used by default).max_depth
: the depth of each tree.Objective
: equivalent to the loss function (squared loss,reg:squarederror
, is the default).min_split_loss
: the minimum loss reduction required to make a further partition on a leaf node of the tree.
More hyperparameters can be used to customize your training. For more details consult the XGBoost documentation
Two model artifacts are generated when we run the training job:
Model.joblib
: The model is exported to GCS file as a joblib object.Eval_result
: The evaluation metrics are exported to GCS as JSON file.
Once the model is trained, it will be used to get challenger predictions for evaluation purposes. In general, the component predict_tensorflow_model
which expects a single CSV file to create predictions for test data is implemented in the pipeline. However, if you are working with larger test data, it is more efficient to
replace it with a Google prebuilt component, ModelBatchPredictOp
,
to avoid crash caused by memory overload.
The XGBoost prediction pipeline can be found in prediction/pipeline.py.
Specifically, it starts with extracting data from BigQuery to a CSV file in Google Cloud Storage, followed by a data skew validation, which uses this CSV file in generate_statistics
component and compares with the schema in assets folder
.
Before calling the batch prediction function, it looks for the champion model among all the trained models. Next, it takes input data from BigQuery for batch prediction and outputs a BigQuery table, namely prediction_<model-display-name>_<job-create-time>
.