Features
- Add support for optional payload encryption in the client SDK and CLI as a follow-up to #484 (#493)
- Allow unicode characters in project(user group) name and domain name. (#1663)
- Improve exception logging stability by pre-formatting exception objects instead of pickling/unpickling them (#1759)
- Add new API to create new image from live session (#1973)
- Clear
error_logs
records in theclear-history
command (#1989) - Introduce
mgr schema dump-history
andmgr schema apply-missing-revisions
command to ease the major upgrade involving deviation of database migration histories (#2002) - Update
image forget
CLI command to untag image from registry before forgetting it from the database (#2010) - Update
etcd-client-py
to 0.3.0 (#2014) - Allow self-ssh in single-node single-container compute sessions. (#2032)
- Prevent deleting mounted folders. (#2036)
- Allow agent to report its internal registry snapshot via UNIX domain socket server (#2038)
- New redis client (experimental) (#2041)
- Expose user info to environment variables (#2043)
- Introduce the
rolling_count
GraphQL field to provide the current rate limit counter for a keypair within the designated time window slice (#2050) - Deprecate the reliance on HTTP cookies for authenticating the pipeline service, switching to the use of HTTP headers instead (#2051)
- Allow user to explicitly set filename of model definition YAML (#2063)
- Add the
backend.ai plugin scan
command to inspect the plugin scan results from various entrypoint sources (#2070) - Bring back etcetra-backed Etcd as an option for ditributed lock backend (#2079)
- Enable distribute-lock configuration (#2080)
- Cache volume objects in
RootContext.get_volume
(#2081) - Revamp images GQL query by changing image filtering from flag-based to feature set-based and add
aliases
field to customized image GQL schema (#2136) - Added missing fields for
keypair_resource_policy
in client-py, models, etc. (#2146) - Add parameters to
check-presets
SDK function (#2153) - Add relay-aware
VirtualFolderNode
GQL Query (#2165) - Also perform basic model service validation process when updating model service via
ModifyEndpoint
(#2167) - Add support for mounting arbitrary VFolders on model service session (#2168)
- Add support for CentOS 8 based kernels (#2220)
- Clear zombie routes automatically (#2229)
- Add
scaling_group.agent_count_by_status
andscaling_group.agent_total_resource_slots_by_status
GQL fields to query the count and the resource allocation of agents that belong to a scaling group. (#2254) - Allow modifying model service session's environment variable setup (#2255)
- Add
endpoint.runtime_variant
column (#2256) - Add new API to show list of supported inference runtimes (#2258)
- Add support for model service provisioning without
model-definition.yaml
(#2260) - Allow superadmins to force-update session status through destroy API. (#2275)
- Add session status check & update API. (#2312)
- Add support for fetching container logs of a specific kernel. (#2364)
- Introduce Python native WSProxy (#2372)
- Implement scanning plugin entrypoints of external packages (#2377)
- Add
row_id
,type
andcontainer_registry
fields to theGroupNode
GQL schema. (#2409) - Add support for PureStorage RapidFiles Toolkit v2 (#2419)
- Add API that extends lifespan of webserver's login session. (#2456)
- Allow bulk association and disassociation of scaling groups with domains, user groups, and key pairs. (#2473)
- Match container's timezone to container host OS when available (#2503)
- Add a pre-setup configuration menu to the TUI installer to allow setting the public-facing address of Backend.AI components (#2541)
- Now Backend.AI can run arbitrary container images without Backend.AI-specific metadata labels by introducing good default values and replacing intrinsic kernel-runner binaries with statically built ones (#2582)
- Allow
Bearer
as valid token type on model service authentication (#2583) - Introduce automatic creation of a 'model-store' group upon inserting a new domain. (#2611)
- Add support for declaring custom description field for GraphQL
relay
edge types. (#2643) - Add an
enable_LLM_playground
option to show/hide the LLM playground tab on the serving page. (#2677) - Add
max_gaudi2_devices_per_container
config on webserver (#2685) - Add
max_atom_plus_device_per_container
config on webserver (#2686) - Introduce Account-manager component. (#2688)
-
- Add query depth limit config of GQL.
- Add page size limit config of GQL Connection.
- Set default page size of GQL Connection to 10. (#2709)
- Add compute session GQL Relay query schema. (#2711)
- Allow
DataLoaderManager
to get a loader function by function itself rather than function name. (#2717) - Allow filter and order in endpointlist gql request. (#2723)
- Add new vfolder API to update sharing status. (#2740)
- Avoid raising a type error even if a particular table in the toml file is empty, as long as the default value for all settings exists. (#2782)
- Add an explicit configuration
scaling-group-type
toagent.toml
so that the agent could distinguish whether itself belongs to an SFTP resource group or not (#2796) - Add per-session priority attributes and
ModifyComputeSession
GraphQL mutation to update session names and priorities (#2840) - Add dependee/dependent/graph ComputeSessionNode connection queries (#2844)
- Implement the priority-aware scheduler that applies to any arbitrary scheduler plugin (#2848)
- Add support for setting a timeout when pulling Docker images and upgrade aiodocker to version 0.23.0. (#2852)
Improvements
- Enable robust DB connection handling by allowing
pool-pre-ping
setting. (#1991) - Enhance update mechanism of session & kernel status. (#2311)
- Remove database-level foreign key constraints in
vfolders.{user,group}
columns to decouple the timing of vfolder deletion and user/group deletion. (#2404) - Implement storage-host RBAC interface. (#2505)
- Optimize the query latency when fetching a large number of agents with stat metrics from Redis (#2558)
- Split out
ai.backend.logging
package from theai.backend.common
to improve reusability and reduce the startup time (i.e., import latencies) (#2760) - Avoid using
collections.OrderedDict
when not necessary in the manager API and client SDK (#2842)
Deprecations
- Remove no longer used
env-tester-{admin,user,user2}.sh
scripts and all references (#1956)
Fixes
- Merge
kernels.role
intosessions.session_type
and check the image compatibility based on comparison with theai.backend.role
label (#1587) - Refactor
PendingSession
Scheduler intoPendingSession
scheduler andAgentSelector
, and replaceroundrobin
flag withAgentSelectionStrategy.RoundRobin
policy. (#1655) - Do not omit to update session's occupying resources to DB when a kernel starts. (#1832)
- Fix DDN command output handling when exceeding quotas. (#1901)
- Explicitly specify the storage-side UID/GID when creating qtrees in the NetApp storage backend (#1983)
- Sync mismatch between
kernels.session_name
andsessions.name
and fix session-rename API to updatesession_name
of sibling kernels atomically. (#1985) - Change function default arguments from mutable object to
None
. (#1986) - Revert some VFolder APIs response type to remove mismatch between
Content-Type
header and body. (#1988) - Upgrade pants to 2.21.0.dev4 for Python 3.12 support in their embedded pex/pip versions (#1998)
- Fix Graylog log adapter not working after upgrading to Python 3.12 (#1999)
- Fix
compute_container
GraphQL query resolver functions. (#2012) - Fix harbor v2 image scanner skipping importing rest of the artifacts when any of the item does not include tag (#2015)
- Let external log viewers display more accurate, meaningful stack frames of logger invocations. (#2019)
- Fix handling of undefined values in the ModifyImage GraphQL mutation. (#2028)
- Fix container commit not working on certain docker engine versions (#2040)
- add omitted request fetching from client to manager about deleting vfolder in trash bin. (#2042)
- Fix a buggy restriction on VFolder deletion due to wrong query condition (#2055)
- Fix wrong usage of dataloader in GQL group resolver. (#2056)
- Ensure that vfolders, including automount vfolders, are mounted during session creation only if their status is not set to "DEAD" (i.e., deleted). (#2059)
- Fix wrong calculation of resource usage (#2062)
- Fix VFolder file operation not working when user has been granted access to shared but deleted VFolder which has same name with the normal one (#2072)
- Add missing type argument in group query (#2073)
- Let the
backend.ai mgr clear-history
command clears session records as well as kernel records (#2077) - Fix
compute_session_list
GQL query not responding on an abundant amount of sessions (#2084) - Fix VFolder invitation not accepted when inviting VFolder shares name with already deleted one (#2093)
- Fix orphan model service routes being created (#2096)
- Fix initialization of the resource usage API's kernel-level usage aggregation (#2102)
- Fix model server starting on every kernels (including sub role kernels) on multi container infernce session (#2124)
- Add missing
commit_session_to_file
toOP_EXC
(#2127) - Fix wrong SQL query build for GQL Relay node (#2128)
- Pass ImageRef.canonical in
commit_session_to_file
(#2134) - Handle fileset-already-exists response of
create-filset
API request and make sure to wait between all GPFS job polling iterations (#2144) - Skip any possible redundant quota update requests when creating new quota (#2145)
-
- Fix error when calling
check_presets
Client SDK API with an invalidgroup
parameter - Rewrite Client SDK to access all APIConfig fields (#2152)
- Fix error when calling
- Ensure that all pending sessions are picked by schedulers (#2155)
- Fix user creation error when any model-store does not exists. (#2160)
- Fix buggy resolver of
model_card
GQL Query. (#2161) - Fix security vulnerability for
sudo_session_enabled
(#2162) - Rename
endpoints.model_mount_destiation
tomodel_mount_destination
(#2163) - Wait for real quota scope directory creation after Netapp
create_qtree()
call (#2170) - Fix wrong per-user concurrency calculation logic (#2175)
- Keep
sync_container_lifecycles()
bgtask alive in a loop. (#2178) - Fix missing check for group (project) vfolder count limit and error handling with an invalid
group
parameter (#2190) - Fix model service persisting on
degraded
status forever in rare chance when trying to delete the service (#2191) - Fix error when query or mutate GraphQL using
BigInt
field type (#2203) - Ensure that utilization idleness is checked after a set period. (#2205)
- Fix
backend.ai ssh
command execution when packaged as SCIE/PEX (#2226) -
- fix
endpoints
query not working when trying to loadimage_row.aliases
- fix
endpoints.status
reportingPROVISIONING
when its status is inDESTROYING
state (#2233)
- fix
- Fix GQL raising error when trying to resolve
endpoints.errors
field occasionally (#2236) - Fix
ZeroDivisionError
in volume usage calculation by returning 0% when volume capacity is zero (#2245) - Fix GraphQL to support query to non-installed images (#2250)
- Add missing
push_image
method implementation to Dummy Agent (#2253) - Rename no-op
access_key
parameter ofendpoint_list
GQL Query touser_uuid
(#2287) - Fix
ai.backend.service-ports
label syntax broken when image does not expose built-in service port (#2288) - Improve stability of
untag_image_from_registry
mutation (#2289) - SSH not working between kernels started with customized image (#2290)
- Invalid container memory capacity reported (#2291)
- Corrected an issue where the
resource_policy
field in the user model was incorrectly mapped todomain_name
. (#2314) - Omit to clean containerless kernels which are still creating its container. (#2317)
- Fix model service sessions created before 24.03.5 failing to spawn (#2318)
- Image commit not working (#2319)
- model service session scheduler (
scale_services()
) failing when sessions bound to active route already marked as terminated (#2320) - Fix container metric collection halted on systems with Cgroups v1 (#2321)
- Run batch execution after the batch session starts. (#2327)
- Add support for configuring
sync_container_lifecycles()
task. (#2338) - Fix mismatches between responses of
/services/_runtimes
and new model service creation input (#2371) - Fix incorrect check of values returned from docker stat API. (#2389)
- Shutdown agent properly by removing a code that waits a cancelled task. (#2392)
- Restrict GraphQL query to
user_nodes
field to requiresuperadmin
privilege (#2401) - Handle all possible exceptions when scheduling single node session so that the status information of pending session is not empty. (#2411)
- Utilize
ExtendedJSONEncoder
for error logging to handleUUID
objects inextra_data
(#2415) - Change outdated references in event module from
kernels
tosessions
. (#2421) - Upgrade
inquirer
to remove dependency on deprecateddistutils
, which breaks up execution of the scie builds (#2424) - Allow specific status of vfolders to query to purge. (#2429)
- Update the install-dev scripts to use
pnpm
instead ofnpm
to speed up installation and resolve some peculiar version resolution issues related to esbuild. (#2436) - Fix a packaging issue in the
backendai-webserver
scie executable due to missing explicit requirement of setuptools (#2454) - Improve pruning of non-physical filesystems when measuring disk usage in agents (#2460)
- Update the install-dev scripts to install
pnpm
if pnpm isn't installed. (#2472) - Improve error handling of initialization failures in the kernel runner (#2478)
- Fix
BACKEND_MODEL_NAME
environment always overwritten to model name specified at model definition (#2481) - Do not allow assigning preopen port which collides with image's own service port definition (#2482)
- Fix GET requests with queryparams defined in API spec occasionally throwing 400 Bad Request error (#2483)
- Check null value of user mutation by
Undefined
sentinel value rather thanNone
. (#2506) - Do null check on
groups.total_resource_slots
anddomains.total_resource_slots
value. (#2509) - Fix hearbeat processing failing when agent reports image with its name not compilant to Backend.AI's naming rule (#2516)
- Corrected a typo (
maanger
corrected tomanager
) in thecheck_status()
API response of the storage component (#2523) - Rename
images.image_filters
GQL Query argument toimages.image_types
(#2555) - Prevent session status from being transit to
PULLING
status event if image pull is not required (#2556) - Prevent other user's customized image from being listed as a response of
images
GQL query (#2557) - skip resolving malformed
ModelCard
GQL item (#2570) - Delete sessions DB records when purging project. (#2573)
- Initialize Redis connection pool objects with specified connection opts rather than ignoring them. (#2574)
- Fix
GET /func/folders/{folderName}
API returning string literal"null"
instead of null value onuser
andgroup
fields (#2584) - Update
GQLPrivilegeCheckMiddleware
to align with upstream changes ongraphql-core
package (#2598) - Robust type check when idle checker fetches utilization data. (#2601)
- Skip mounting zero-byte lxcfs files when lxcfs is activated to prevent crashes in session containers (#2604)
- Fix typo in minilang query field spec and column map. (#2605)
- Remove duplicate CPU quota arguments when creating containers (#2608)
- Increase
MAX_CMD_LEN
of dropbear to improve compatibility with PyCharm debugger (#2613) - Silence falsy Redis timeout warnings when retrying blocking commands if the timeout does not exceed the expected command timeout (#2632)
- Fix a regression of #2483 in the session-download API used by the
backend.ai ssh
command (#2635) - Implement missing
StrEnumType
handling inpopulate_fixture()
. (#2648) - Let
GET /resource/usage/period
request contain data in query parameter rather than JSON body. (#2661) - Allow sudo-enabled container users to ovewrite
/usr/bin/scp
and/usr/libexec/sftp-server
by unifying the intrinsic ssh binaries to use the mergeddropbearmulti
executable. (#2667) - Update
webserver
logout API to respond with HTTP 200 OK (#2681) - Fix WSProxy not properly handling WebSocket request sent from Firefox (#2684)
- Scan parent directory of created qtree to avoid creating quota on non-existing directory. (#2696)
- Fix
list_files
,get_fstab_contents
,get_performance_metric
andshared_vfolder_info
Python SDK function not working withValidationError
exception printed (#2706) - Resolve the issue where the vfolder id does not match in
list_shared_vfolders
. (#2731) - Handle OS Error when deleting vfolders. (#2741)
- Fix typo in Virtual-folder status update code. (#2742)
- Correct
msgpack
deserialization ofResourceSlot
. (#2754) - Fix regression error of
session create_from_template
command. (#2761) - Silence
model_
namespace warnings with pydantic-based model classes (#2765) - Change the initialization order of PackageContext to apply
target_path
correctly in the TUI installer (#2768) - Make the regex patterns to update configuration files working with multiline texts correctly in the TUI installer (#2771)
- Omit null parameter when call
usage-per-period
API. (#2777) - Delete vfolder invitation and permission rows when deleting vfolders. (#2780)
- Handle container port mismatch when creating kernel. (#2786)
- Explicitly set the protected service ports depending on the resource group type and the service types (#2797)
- Correct session status determiner function. (#2803)
- Fix
endpoint_list.total_count
GQL field returning incorrect value (#2805) - Fix
Service.create()
SDK method andservice create
CLI command not working withUnboundLocalError
exception (#2806) - Refresh expiration time of login session when login. (#2816)
- Fix
kernel_id
assignment for main kernel log retrieval (#2820) - Use a safer TLS version (v1.2) when creating SSL sockets in the logstash handler (#2827)
- Wrong count of concurrent compute sessions. (#2829)
- Create kernels with correct
scaling_group
value. (#2837) - Fix a regression in progress bar rendering of the TUI installer after upgrading the Textual library (#2867)
Documentation Updates
- Add note about installing client library with same version as server (#1976)
- Remove deprecated
version
from the docker compose YAML templates in package installation docs. (#2035) - Fix a typo in the
agent.toml
example of the package-based installation guide to have a duplicate double quote (#2069)
External Dependency Updates
- Upgrade the base runtime (CPython) version from 3.11.6 to 3.12.2 (#1994)
- Upgrade aiodocker to v0.22.0 with minor bug fixes found by improved type annotations (#2339)
- Update the halfstack containers to point the latest stable versions (#2367)
- Upgrade aiodocker to 0.22.1 to fix error handling when trying to extract the log of non-existing containers (#2402)
- Upgrade the base CPython from 3.12.2 to 3.12.4 (#2449)
- Upgrade Python (3.12.4 -> 3.12.6) and common/tool dependencies to prepare for Python 3.13 and apply latest fixes (#2851)
Miscellaneous
- Wrap RPC authentication error to custom error for better logging. (#1970)
- Add
requested_slots
field to compute session GQL type. (#1984) - Allow
pydantic.BaseModel
as the API handler return schema. (#1987) - Fix incorrect version notation of GQL Field. (#1993)
- Add max_pending_session_count field to Keypair resource policy GQL schema (#2013)
- Handle container creation exception and start exception in separate try-except contexts. (#2316)
- Fix broken the workflow call for the action that auto-assigns PR numbers to news fragments (#2358)
- Finally stabilize the hanging tests in our CI due to docker-internal races on TCP port mappings to concurrently spawned fixture containers by introducing monotonically increasing TCP port numbers (#2379)
- Further improve the monotonic port allocation logic for the test containers to remove maximum concurrency restrictions (#2396)
- Add PEX, SCIE binary build configs for the plugin subsystem. (#2422)
-
- Add POST
/folders
API endpoints to replace DELETE APIs that require request body. - Allow
DELETE
requests to have body data. (#2571)
- Add POST
- Enhacne type hints for potential
None
arguments (#2580) - Add
ai.backend.manager.models.graphql
module for better code base management. (#2669) - Remove Scheduler related types that are no longer used. (#2705)
- Allow adding required GQL field argument to schema. (#2712)
- Upgrade
readthedocs
build environment to Python 3.12 (#2814)## 24.03.0rc1 (2024-03-31)
Features
- Allw filter
compute_session
query byuser_id
. (#1805) - Allow overriding vfolder mount permissions in API calls and CLI commands to create new sessions, with addition of a generic parser of comma-separated "key=value" list for CLI args and API params (#1838)
- Always enable
ai.backend.accelerator.cuda_open
in the scie-based installer (#1966) - Use
config["pipeline"]["endpoint"]
as default value ofconfig["pipeline"]["frontend-endpoint"]
when not provided (#1972) - Migrate container registry config storage from
Etcd
toPostgreSQL
(#1917) - Implement ID-based client workflow to ContainerRegistry API. (#2615)
- Rafactor Base ContainerRegistry's
scan_tag
and implementMEDIA_TYPE_DOCKER_MANIFEST
type handling. (#2620) - Support GitHub Container Registry. (#2621)
- Support GitLab Container Registry. (#2622)
- Support AWS ECR Public Container Registry. (#2623)
- Support AWS ECR Private Container Registry. (#2624)
- Replace rescan command's
--local
flag with local container registry record. (#2665) - Add
project
column to the images table and refactoringImageRef
logic. (#2707) - Support docker image manifest v2 schema1. (#2815)
- Add
filter
andorder
parameters to Group GQL Relay API. (#2863) - Add
vast_use_auth_token
config to utilize VASTData API token optionally. (#2901) - Use a valid value for the
id
field in the GQL schema query resolver forContainerRegistry
. (#2908)
Fixes
- Set single agent per kernel resource usage. (#1725)
- Abort container creation when duplicate container port definition exists (#1750)
- To update image metadata, check if the min/max values in
resource_limits
are undefined. (#1941) - Explicitly disable the user-site package detection in the krunner python commands to avoid potential conflicts with user-installed packages in
.local
directories (#1962) - Fix
caf54fcc17ab
migration to drop a primary key only if it exists and in589c764a18f1
migration, add missing table arguments. (#1963) - Explicitly wait for readiness of the Docker daemon and the compose stack before pouring database fixtures in
install-dev.sh
for when installing at the provisioning stage of Codespaces and integration tests in CI. (#2378) - Add missing implementation of wsproxy and manager CLI's log-level customization options (#2698)
- Add missing batch execution call after session starts (#2884)
- Fix a regression of the unicode-aware slug update that prevented creation of dot-prefixed (automount) vfolders (#2892)
- Fix invalid image format log spam in Agent (#2894)
- Fix wrong creation of
raw_configs
in_create_kernels_in_one_agent
(#2896) - Assign valid value to
id
field inContainerRegistryNode
GQL schema query resolver. (#2899) - Update vast quota rather than raise error when quota exists. (#2900)
- Calculate correct expiration time of VAST auth token and add
vast_force_login
config to enable login before every REST API call (#2911)
Documentation Updates
- Update docstrings in
ai.backend.client.request.Request:fetch()
andai.backend.client.request.FetchContextManager
as the support for synchronous context manager has been deprecated. (#1801) - Resize font-size of footer text in ethical ads in documentation hosted by read-the-docs (#1965)
- Only resize font-size of footer text in ethical ads not in title of content in documentation (#1967)
Miscellaneous
- Revert response type of service create API. (#1979)
Full Changelog
Check out the full changelog until this release (24.09.0).
Full Commit Logs
Check out the full commit logs between release (24.09.0rc1) and (24.09.0).