description: learn how to efficiently deploy to GPUs using triton inference server
The notebooks in this directory show how to take advantage of the interoperability between Azure Machine Learning and NVIDIA Triton Inference Server for cost-effective real time inference on GPUs.
Open either of the sample notebooks in this directory to run Triton in Python.
You must have the latest version of the Azure Machine Learning CLI installed to run these commands. Follow the instructions here to download or upgrade the CLI.
python ../../code/deployment/triton/model_utils.py
az ml model register -p ../../models/triton -n bidaf-model --model-framework=Multi
az ml model deploy -n triton-webservice -m bidaf-model:1 --dc deploymentconfig.json --compute-target aks-gpu-deploy
Once you have deployed, try querying the model metadata endpoint:
# Get the scoring URI
az ml service show --name triton-webservice
# Get the keys
az ml service get-keys --name triton-webservice
curl -H "Authorization: Bearer <primaryKey>" -v <scoring-uri>v2/ready
Read more about the KFServing predict API here.