v0.2.0
What's changed?
-
Documentation
- Updated readme and documentation
-
API & SDK
- Full OpenAI API compatibility (Aviary can now be queried with the
openai
Python package)/v1/completions
- Parameters not yet supported (will be ignored):
suffix
,n
,logprobs
,echo
,best_of
,logit_bias
,user
- Additional parameters not present in OpenAI API:
top_k
,typical_p
,watermark
,seed
- Parameters not yet supported (will be ignored):
/v1/chat/completions
- Parameters not yet supported (will be ignored):
n
,logprobs
,echo
,logit_bias
,user
- Additional parameters not present in OpenAI API:
top_k
,typical_p
,watermark
,seed
- Parameters not yet supported (will be ignored):
/v1/models
/v1/models/<MODEL>
- Added
frequency_penalty
andpresence_penalty
parameters aviary run
is now blocking by default and will clarify that rerunningaviary run
will remove existing models- Streamlined model configuration YAMLs
- Added model configuration YAMLs for llama-2
- Frontend Gradio app will now be started on
/frontend
route to avoid conflicts with backend openai
package is now a dependency for Aviary
- Full OpenAI API compatibility (Aviary can now be queried with the
-
Backend
- Refactor of multiple internal APIs
- Renamed
Predictor
toEngine
Engine
combines the functionality of initializers, predictors and pipelines.- Removed
Predictor
andPipeline
- Removed shallow classes and simplified abstractions
- Removed dead code
- Broke up large files & improved file structure
- Renamed
- Removal of static batching
- Added OpenAI-style
frequency_penalty
andpresence_penalty
parameters - Fixed generated special tokens not being returned correctly
- Standardization of modelling code on an Apache 2.0 fork of text-generation-inference
- Improved performance and stability
- Added automatic warmup for supported models, ensuring that memory is used efficiently.
- Made scheduler and scheduler policy less prone to errors.
- Made sure that
HUGGING_FACE_HUB_TOKEN
env var is propagated throughout all Aviary Backend processes to allow access to gated models such as llama-2 - Added unit testing for core Aviary components
- Added validations for user supplied parameters
- Improved error handling and reporting
- Error responses will now have correct status codes
- Added basic observability for tokens & requests through Ray Metrics (piped through to Prometheus/Grafana)
- Refactor of multiple internal APIs
This update introduces breaking changes to model configuration YAMLs and the Aviary SDK. Refer to the migration guide below for more details.
In order to use Aviary backend, ensure you are using the official Docker image anyscale/aviary:latest
. Using the backend without Docker is not a supported usecase. anyscale/aviary:latest-tgi
image has been superseded by anyscale/aviary:latest
.
Migration Guide For Model YAMLs
In the most recent version of Aviary we introduce breaking changes in the model YAMLs. This guide will help you migrate your existing model YAMLs to the new format.
Changes
- Move any fields under
model_config.initialization
to be undermodel_config
and then removemodel_config.initialization
.
Then remove the following sections/fields and everything that is under them:
- model_config.initializer
- model_config.pipeline
- model_config.batching
-
Rename
model_config
toengine_config
.In v0.2, we introduce
Engine
, the Aviary abstraction for interacting with a model. In short,Engine
combines the functionality ofinitializers
,pipelines
, andpredictors
.Pipeline and initializer parameters are no longer configurable.
In v0.2 we remove the option to specify static batching and instead do continuous batching by default for performance improvement. -
Add the
Scheduler
andPolicy
configs.The scheduler is a component of the engine that determines which requests to run inference on. The policy is a component of the scheduler that determines the scheduling strategy. These components previously existed in Aviary, however they weren't explicitly configurable.
Previously the following parameters were specified under
model_config.generation
:max_batch_total_tokens
max_total_tokens
max_waiting_tokens
max_input_length
max_batch_prefill_tokens
rename
max_waiting_tokens
tomax_iterations_curr_batch
place these parameters under
engine_config.scheduler.policy
for example:
engine_config: scheduler: policy: max_iterations_curr_batch: 100 max_batch_total_tokens: 100000 max_total_tokens: 100000 max_input_length: 100 max_batch_prefill_tokens: 100000