-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate GraphExecutionContext
#43
Comments
@abrown, the session/execution context interface caches the parameterization of configuration and device selection. You are going to leave this state on the The additional scenario for execution context is passing in this parameterization across multiple models (chaining) or multiple modes - natively and in the browser, we pass the GPU context between DirectX and ONNX or WebGPU and WebNN. |
As discussed in WebAssembly#43, there is no requirement to set up tensors prior to calling `compute` as well as retrieving them separately afterwards. As of WebAssembly#59, passing around tensors is cheap (they're resources now), so there is no data copy necessary if we adopt this PR. This change proposes removing the `set-input` and `get-output` functions, moving all of the tensor-passing to `compute`. Closes WebAssembly#43.
This issue proposes a simplification to the wasi-nn API: eliminate the
GraphExecutionContext
state object altogether and instead simply pass all tensors to and from an inference call —compute(list<tensor>) -> result<list<tensor>, ...>
. This change would makeset_input
andget_output
unnecessary and they would also be removed.As background, the WITX IDL is the cause of
GraphExecutionContext
's existence. As I understood it back when wasi-nn was originally designed, WITX forced us to pass an empty "pointer + length" buffer across to the host so that the host could fill it. This led toget_output(...)
, which included an index parameter for retrieving tensors from multi-output models (multiple outputs is a possibility that must be handled, though not too common). Becauseget_output
was now separate fromcompute
, we needed some state to track the inference request —GraphExecutionContext
.Now, with WIT, we can expect the ABI to be able the host-allocate into our WebAssembly linear memory for us. This is better in two ways:
get_output(...)
, the user had to statically know how large that buffer should be; this led to thoughts about expanding the API with some introspection (feature: describe graph inputs and outputs #37); if we replaceGraphExecutionContext
with a singlecompute
call, this is no longer a user-facing paper cutcompute
accepts and returns all the necessary tensors, thenGraphExecutionContext
,set_input
, andget_output
can all be removed, making the API easier to explain and useOne consideration here is ML framework compatibility: some frameworks (e.g., OpenVINO) expose an equivalent to
GraphExecutionContext
in their external API that must be called by implementations of wasi-nn. But, because this context object can be created inside the implementation, there is no compatibility issue. Implementations ofcompute
will simply do a bit more than they currently do, but no more overall work than they do currently.Another consideration is memory copying overhead: will WIT force us to copy the tensor bytes across the guest-host boundary in both directions? Tensors can be large and additional copies could be expensive. For output tensors, this may be unavoidable: when the tensor is generated on the host side during inference it must be made accessible to the Wasm guest somehow — copying is a simple solution. For input tensors, though, this discussion might suggest that there is no WIT-inherent limitation to avoid the copy. If tensor copying becomes a bottleneck, perhaps WIT resources could be the solution.
The text was updated successfully, but these errors were encountered: