ModelMesh Serving serves different kinds of models via different Serving Runtime implementations. A Serving Runtime is one or more containers which:
- Can dynamically load and unload models from disk into memory on demand
- Exposes a gRPC service endpoint to serve inferencing requests for loaded models
More specifically, the container(s) must:
- Implement the simple model management gRPC SPI which comprises RPC methods to load/unload models, report their size, and report the runtime's total capacity
- Implement one or more other arbitrary gRPC services to serve inferencing requests for already-loaded models
These gRPC services for (2) must all be served from the same server endpoint. The management service SPI may be served by that same endpoint or a different one. Each of these endpoints may listen on a localhost
port, or a unix domain socket. For best performance, a domain socket is preferred for the inferencing endpoint, and the corresponding file should be created in an empty dir within one of the containers. This dir will become a mount in all of the runtime containers when they are run.
Below is a description of how to implement the mmesh.ModelRuntime
gRPC service, specified in model-runtime.proto
. Note that this is currently subject to change, but we will try to ensure that any changes are backwards-compatible or at least will require minimal change on the runtime side.
So that ModelMesh Serving can decide when/where models should be loaded and unloaded, a given serving runtime implementation must communicate details of how much capacity it has to hold loaded models in memory, as well as how much each loaded model consumes.
Model sizes are communicated in a few different ways:
-
A rough "global" default/average size for all models must be provided in the
defaultModelSizeInBytes
field in the response to theruntimeStatus
rpc method. This should be a very conservative estimate. -
A predicted size can optionally be provided by implementing the
predictModelSize
rpc method. This will be called prior toloadModel
and if implemented should return immediately (for example it should not make remote calls which could be delayed). -
The more precise size of an already-loaded model can be provided by either:
- Including it in the
sizeInBytes
field of the response to theloadModel
rpc method - Not setting in the
loadModel
response, and instead implementing the separatemodelSize
method to return the size. This will be called immediately afterloadModel
returns, and isn't required to be implemented if the first option is used.
The second of these last two options is preferred when a separate step is required to determine the size after the model has already been loaded. This is so that the model can start to be used for inferencing immediately, while the sizing operation is still in progress.
- Including it in the
Capacity is indicated once via the capacityInBytes
field in the response to the runtimeStatus
rpc method and assumed to be constant.
Ideally, the value of capacityInBytes
should be calculated dynamically as a function of your model server container's allocated memory. One way to arrange this is via Kubernetes' Downward API - mapping the container's requests.memory
property to an environment variable. Of course some amount of fixed overhead should likely be subtracted from this value:
env:
- name: MODEL_SERVER_MEM_REQ_BYTES
valueFrom:
resourceFieldRef:
containerName: my-model-server
resource: requests.memory
message RuntimeStatusRequest {}
This is polled at the point that the main model-mesh container starts to check that the runtime is ready. You should return a response with status
set to STARTING
until the runtime is ready to accept other requests and load/serve models at which point status
should be set to READY
.
The other fields in the response only need to be set in the READY
response (and will be ignored prior to that). Once READY
is returned, no further calls will be made unless the model-mesh container unexpectedly restarts.
Currently, to ensure overall consistency of the system, it is required that runtimes purge any loaded/loading models when receiving a runtimeStatus
call, and do not return READY
until this is complete. Typically, it's only called during initialization prior to any load/unloadModel calls and hence this "purge" will be a no-op. But runtimes should also handle the case where there are models loaded. It is likely that this requirement will be removed in a future update, but ModelMesh Serving will remain compatible with runtimes that still include the logic.
message RuntimeStatusResponse {
enum Status {
STARTING = 0;
READY = 1;
FAILING = 2; //not used yet
}
Status status = 1;
// memory capacity for static loaded models, in bytes
uint64 capacityInBytes = 2;
// maximum number of model loads that can be in-flight at the same time
uint32 maxLoadingConcurrency = 3;
// timeout for model loads in milliseconds
uint32 modelLoadingTimeoutMs = 4;
// conservative "default" model size,
// such that "most" models are smaller than this
uint64 defaultModelSizeInBytes = 5;
// version string for this model server code
string runtimeVersion = 6;
message MethodInfo {
// "path" of protobuf field numbers leading to the string
// field within the request method corresponding to the
// model name or id
repeated uint32 idInjectionPath = 1;
}
// optional map of per-gRPC rpc method configuration
// keys should be fully-qualified gRPC method name
// (including package/service prefix)
map<string,MethodInfo> methodInfos = 8;
// EXPERIMENTAL - Set to true to enable the mode where
// each loaded model reports a maximum inferencing
// concurrency via the maxConcurrency field of
// the LoadModelResponse message. Additional requests
// are queued in the modelmesh framework. Turning this
// on will also enable latency-based autoscaling for
// the models, which attempts to minimize request
// queueing time and requires no other configuration.
bool limitModelConcurrency = 9;
}
message LoadModelRequest {
string modelId = 1;
string modelType = 2;
string modelPath = 3;
string modelKey = 4;
}
The runtime should load a model with name/id specified by the modelId
field into memory ready for serving, from the path specified by the modelPath
field. At this time, the modelType
field value should be ignored.
The modelKey
field will contain a JSON string with the following contents:
{
"model_type": {
"name": "mytype",
"version": "2"
}
}
Where model_type
is the modelType
section from the originating Predictor
custom resource. Note that version
is optional and may not be present. In future, additional attributes might be present in the outer json object so your implementation should ignore them gracefully.
The response shouldn't be returned until the model has loaded successfully and is ready to use.
message LoadModelResponse {
// OPTIONAL - If nontrivial cost is involved in
// determining the size, return 0 here and
// do the sizing in the modelSize function
uint64 sizeInBytes = 1;
// EXPERIMENTAL - Applies only if limitModelConcurrency = true
// was returned from runtimeStatus rpc.
// See RuntimeStatusResponse.limitModelConcurrency for more detail
uint32 maxConcurrency = 2;
}
message UnloadModelRequest {
string modelId = 1;
}
The runtime should unload the previously loaded (or failed) model specified by modelId
, and not return a response until the unload is complete and corresponding resources have been freed. If the specified model is not found/loaded, the runtime should return immediately (without error).
message UnloadModelResponse {}
The model runtime server can expose any number of protobuf-based gRPC services on the grpcDataEndpoint
to use for inferencing requests. ModelMesh Serving is agnostic to specific service definitions (request/response message content), but for tensor-in/tensor-out based services it is recommended to conform to the KFServing V2 dataplane API spec.
A given model runtime server will be guaranteed to only receive model inferencing requests for models that had previously completed loading successfully (via a loadModel
request), and to have not been unloaded since.
Though generally agnostic to the specific API methods, ModelMesh Serving does need to be able to set/override the model name/id used in a given request. There are two options for obtaining the model name/id within the (which will correspond to the same id previously passed to loadModel
):
- Obtain from one of the
mm-model-id
ormm-model-id-bin
gRPC metadata headers (latter required for non-ASCII UTF-8 ids). Precisely how this is done depends on the implementation language - see gRPC documentation for more information (TODO specific refs/examples here). - Provide the location of a specific string field within your request protobuf message (per RPC method) which will be replaced with the target model id. This is done via the
methodInfos
map in the runtime's response to theruntimeStatus
RPC method. Each applicable inferencing method should have an entry whoseidInjectionPath
field is set to a list of field numbers corresponding to the heirarchy of nested messages within the request message, the last of which being the number of the string field to replace. For example, if the id is a string field in the top-level request message with number 1 (as is the case in the KFServing V2ModelInferRequest
), this list would be set to just[1]
.
Option 2 is particularly applicable when integrating with an existing gRPC-based model server.
Each Serving Runtime implementation is defined using the custom resource type ServingRuntime
which defines information about the runtime such as which container images need to be loaded, and the local gRPC endpoints on which they will listen. When the resource is applied to the Kubernetes cluster, the model server will deploy the runtime specific containers which will then enable support for the corresponding model types.
The following is an example of a ServingRuntime
custom resource
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: example-runtime
spec:
supportedModelTypes:
- name: new-modeltype
version: "1"
containers:
- name: model-server
image: samplemodelserver:latest
grpcEndpoint: "port:8085"
grpcDataEndpoint: "port:8090"
The ability to specify multiple containers provides a nice way to integrate with existing model servers via an adapter pattern, as long as they provide the required capability of dynamically loading and unloading models.
Note: In the above diagram, only the adapter and model server containers are explicitly specified in the ServingRuntime CR, the others are included automatically.
The built-in runtimes based on Nvidia's Triton Inferencing Server and the Seldon MLServer, and their corresponding adapters serve as good examples of this and can be used as a reference.
Available attributes in the ServingRuntime
spec:
Attribute | Description |
---|---|
disable |
Disables this runtime |
containers |
List of containers associated with the runtime |
containers[ ].image |
The container image for the current container |
containers[ ].command |
Executable command found in the provided image |
containers[ ].args |
List of command line arguments as strings |
containers[ ].resources |
Kubernetes limits or requests |
containers[ ].imagePullPolicy |
The container image pull policy |
containers[ ].workingDir |
The working directory for current container |
grpcEndpoint |
The port for model management requests |
grpcDataEndpoint |
The port or unix socket for inferencing requests arriving to the model server over the gRPC protocol. May be set to the same value as grpcEndpoint |
supportedModelTypes |
List of model types supported by the current runtime |
supportedModelTypes[ ].name |
Name of the model type |
supportedModelTypes[ ].version |
Version of the model type. It is recommended to include only the major version here, for example "1" rather than "1.15.4" |
storageHelper.disabled |
Disables the storage helper |
nodeSelector |
Influence Kubernetes scheduling to assign pods to nodes |
affinity |
Influence Kubernetes scheduling to assign pods to nodes |
tolerations |
Allow pods to be scheduled onto nodes with matching taints |
replicas |
The number of replicas of the runtime to create. This overrides the podsPerRuntime configuration |
Several of the attributes (grpcEndpoint
, grpcDataEndpoint
) support either Unix Domain Sockets or TCP. The endpoint should be formatted as either port:<number>
or unix:<path>
. The provided container must be either listening on the specific TCP socket or at the provided path.
Note If a unix domain socket is specified for both grpcEndpoint
and grpcDataEndpoint
then it must either be the same socket (identical path) or reside in the same directory.
The following example demonstrates all of the possible attributes that can be set in the model serving runtime spec:
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
name: example-runtime
spec:
supportedModelTypes:
- name: my_model_type # name of the model
version: "1"
containers:
- args:
- arg1
- arg2
command:
- command
- command2
env:
- name: name
value: value
- name: fromSecret
valueFrom:
secretKeyRef:
key: mykey
image: image
name: name
resources:
limits:
memory: 200Mi
imagePullPolicy: IfNotPresent
WorkingDir: "/container/working/dir"
disabled: false
gpu: false
storageHelper:
disabled: true
grpcEndpoint: port:1234 # or unix:/path
grpcDataEndpoint: port:1234 # or unix:/path
httpDataEndpoint: port:1234 # Not yet supported
# To influence pod scheduling, one or more of the following can be used
nodeSelector: # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector
kubernetes.io/arch: "amd64"
affinity: # https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "kubernetes.io/arch"
operator: In
values:
- "amd64"
tolerations: # https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
- key: "example-key"
operator: "Exists"
effect: "NoSchedule"
Storage helper will download the model from the S3 bucket using the secret storage-config
and place it in the local path. By default, storage helper is enabled in the serving runtime. Storage helper can be disabled by adding the configuration storageHelper.disabled
set to true
in serving runtime. If the storage helper is disabled, the custom runtime needs to handle access to and pulling model data from storage itself. Configuration can be passed to the runtime's pods through environment variables.
Consider the custom runtime defined above with the following predictor:
apiVersion: serving.kserve.io/v1alpha1
kind: Predictor
metadata:
name: my-mnist-predictor
spec:
modelType:
name: my_model_type
version: "1"
path: my_models/mnist-model
storage:
s3:
secretKey: my_storage
bucket: my_bucket
If the storage helper is enabled, the model serving container will receive the below model metadata in the loadModel
call where modelPath
will contain the path of the model in the local file system.
{
"modelId": "my-mnist-predictor-<suffix>",
"modelType": "my_model_type",
"modelPath": "/models/my-mnist-predictor-<suffix>/",
"modelKey": "<serialized metadata as JSON, see below>"
}
The following metadata for the predictor is serialized to a string and embedded as the modelKey
field:
{
"bucket": "my_bucket",
"disk_size_bytes": 2415,
"model_type": {
"name": "my_model_type",
"version": "1"
},
"storage_key": "my_storage"
}
If the storage helper is disabled, the model serving container will receive the below model metadata in the loadModel
call where modelPath
is same as the path
provided in the predictor CR.
{
"modelId": "my-mnist-predictor-<suffix>",
"modelType": "my_model_type",
"modelPath": "my_models/mnist-model",
"modelKey": "<serialized metadata as JSON, see below>"
}
The following metadata for the predictor is serialized to a string and embedded as the modelKey
field:
{
"bucket": "my_bucket",
"model_type": {
"name": "my_model_type",
"version": "1"
},
"storage_key": "my_storage"
}