Release 24.09.0 · lablup/backend.ai

Features

Add support for optional payload encryption in the client SDK and CLI as a follow-up to #484 (#493)
Allow unicode characters in project(user group) name and domain name. (#1663)
Improve exception logging stability by pre-formatting exception objects instead of pickling/unpickling them (#1759)
Add new API to create new image from live session (#1973)
Clear error_logs records in the clear-history command (#1989)
Introduce mgr schema dump-history and mgr schema apply-missing-revisions command to ease the major upgrade involving deviation of database migration histories (#2002)
Update image forget CLI command to untag image from registry before forgetting it from the database (#2010)
Update etcd-client-py to 0.3.0 (#2014)
Allow self-ssh in single-node single-container compute sessions. (#2032)
Prevent deleting mounted folders. (#2036)
Allow agent to report its internal registry snapshot via UNIX domain socket server (#2038)
New redis client (experimental) (#2041)
Expose user info to environment variables (#2043)
Introduce the rolling_count GraphQL field to provide the current rate limit counter for a keypair within the designated time window slice (#2050)
Deprecate the reliance on HTTP cookies for authenticating the pipeline service, switching to the use of HTTP headers instead (#2051)
Allow user to explicitly set filename of model definition YAML (#2063)
Add the backend.ai plugin scan command to inspect the plugin scan results from various entrypoint sources (#2070)
Bring back etcetra-backed Etcd as an option for ditributed lock backend (#2079)
Enable distribute-lock configuration (#2080)
Cache volume objects in RootContext.get_volume (#2081)
Revamp images GQL query by changing image filtering from flag-based to feature set-based and add aliases field to customized image GQL schema (#2136)
Added missing fields for keypair_resource_policy in client-py, models, etc. (#2146)
Add parameters to check-presets SDK function (#2153)
Add relay-aware VirtualFolderNode GQL Query (#2165)
Also perform basic model service validation process when updating model service via ModifyEndpoint (#2167)
Add support for mounting arbitrary VFolders on model service session (#2168)
Add support for CentOS 8 based kernels (#2220)
Clear zombie routes automatically (#2229)
Add scaling_group.agent_count_by_status and scaling_group.agent_total_resource_slots_by_status GQL fields to query the count and the resource allocation of agents that belong to a scaling group. (#2254)
Allow modifying model service session's environment variable setup (#2255)
Add endpoint.runtime_variant column (#2256)
Add new API to show list of supported inference runtimes (#2258)
Add support for model service provisioning without model-definition.yaml (#2260)
Allow superadmins to force-update session status through destroy API. (#2275)
Add session status check & update API. (#2312)
Add support for fetching container logs of a specific kernel. (#2364)
Introduce Python native WSProxy (#2372)
Implement scanning plugin entrypoints of external packages (#2377)
Add row_id, type and container_registry fields to the GroupNode GQL schema. (#2409)
Add support for PureStorage RapidFiles Toolkit v2 (#2419)
Add API that extends lifespan of webserver's login session. (#2456)
Allow bulk association and disassociation of scaling groups with domains, user groups, and key pairs. (#2473)
Match container's timezone to container host OS when available (#2503)
Add a pre-setup configuration menu to the TUI installer to allow setting the public-facing address of Backend.AI components (#2541)
Now Backend.AI can run arbitrary container images without Backend.AI-specific metadata labels by introducing good default values and replacing intrinsic kernel-runner binaries with statically built ones (#2582)
Allow Bearer as valid token type on model service authentication (#2583)
Introduce automatic creation of a 'model-store' group upon inserting a new domain. (#2611)
Add support for declaring custom description field for GraphQL relay edge types. (#2643)
Add an enable_LLM_playground option to show/hide the LLM playground tab on the serving page. (#2677)
Add max_gaudi2_devices_per_container config on webserver (#2685)
Add max_atom_plus_device_per_container config on webserver (#2686)
Introduce Account-manager component. (#2688)
- Add query depth limit config of GQL.
- Add page size limit config of GQL Connection.
- Set default page size of GQL Connection to 10. (#2709)
Add compute session GQL Relay query schema. (#2711)
Allow DataLoaderManager to get a loader function by function itself rather than function name. (#2717)
Allow filter and order in endpointlist gql request. (#2723)
Add new vfolder API to update sharing status. (#2740)
Avoid raising a type error even if a particular table in the toml file is empty, as long as the default value for all settings exists. (#2782)
Add an explicit configuration scaling-group-type to agent.toml so that the agent could distinguish whether itself belongs to an SFTP resource group or not (#2796)
Add per-session priority attributes and ModifyComputeSession GraphQL mutation to update session names and priorities (#2840)
Add dependee/dependent/graph ComputeSessionNode connection queries (#2844)
Implement the priority-aware scheduler that applies to any arbitrary scheduler plugin (#2848)
Add support for setting a timeout when pulling Docker images and upgrade aiodocker to version 0.23.0. (#2852)

Improvements

Enable robust DB connection handling by allowing pool-pre-ping setting. (#1991)
Enhance update mechanism of session & kernel status. (#2311)
Remove database-level foreign key constraints in vfolders.{user,group} columns to decouple the timing of vfolder deletion and user/group deletion. (#2404)
Implement storage-host RBAC interface. (#2505)
Optimize the query latency when fetching a large number of agents with stat metrics from Redis (#2558)
Split out ai.backend.logging package from the ai.backend.common to improve reusability and reduce the startup time (i.e., import latencies) (#2760)
Avoid using collections.OrderedDict when not necessary in the manager API and client SDK (#2842)

Deprecations

Remove no longer used env-tester-{admin,user,user2}.sh scripts and all references (#1956)

Fixes

Merge kernels.role into sessions.session_type and check the image compatibility based on comparison with the ai.backend.role label (#1587)
Refactor PendingSession Scheduler into PendingSession scheduler and AgentSelector, and replace roundrobin flag with AgentSelectionStrategy.RoundRobin policy. (#1655)
Do not omit to update session's occupying resources to DB when a kernel starts. (#1832)
Fix DDN command output handling when exceeding quotas. (#1901)
Explicitly specify the storage-side UID/GID when creating qtrees in the NetApp storage backend (#1983)
Sync mismatch between kernels.session_name and sessions.name and fix session-rename API to update session_name of sibling kernels atomically. (#1985)
Change function default arguments from mutable object to None. (#1986)
Revert some VFolder APIs response type to remove mismatch between Content-Type header and body. (#1988)
Upgrade pants to 2.21.0.dev4 for Python 3.12 support in their embedded pex/pip versions (#1998)
Fix Graylog log adapter not working after upgrading to Python 3.12 (#1999)
Fix compute_container GraphQL query resolver functions. (#2012)
Fix harbor v2 image scanner skipping importing rest of the artifacts when any of the item does not include tag (#2015)
Let external log viewers display more accurate, meaningful stack frames of logger invocations. (#2019)
Fix handling of undefined values in the ModifyImage GraphQL mutation. (#2028)
Fix container commit not working on certain docker engine versions (#2040)
add omitted request fetching from client to manager about deleting vfolder in trash bin. (#2042)
Fix a buggy restriction on VFolder deletion due to wrong query condition (#2055)
Fix wrong usage of dataloader in GQL group resolver. (#2056)
Ensure that vfolders, including automount vfolders, are mounted during session creation only if their status is not set to "DEAD" (i.e., deleted). (#2059)
Fix wrong calculation of resource usage (#2062)
Fix VFolder file operation not working when user has been granted access to shared but deleted VFolder which has same name with the normal one (#2072)
Add missing type argument in group query (#2073)
Let the backend.ai mgr clear-history command clears session records as well as kernel records (#2077)
Fix compute_session_list GQL query not responding on an abundant amount of sessions (#2084)
Fix VFolder invitation not accepted when inviting VFolder shares name with already deleted one (#2093)
Fix orphan model service routes being created (#2096)
Fix initialization of the resource usage API's kernel-level usage aggregation (#2102)
Fix model server starting on every kernels (including sub role kernels) on multi container infernce session (#2124)
Add missing commit_session_to_file to OP_EXC (#2127)
Fix wrong SQL query build for GQL Relay node (#2128)
Pass ImageRef.canonical in commit_session_to_file (#2134)
Handle fileset-already-exists response of create-filset API request and make sure to wait between all GPFS job polling iterations (#2144)
Skip any possible redundant quota update requests when creating new quota (#2145)
- Fix error when calling check_presets Client SDK API with an invalid group parameter
- Rewrite Client SDK to access all APIConfig fields (#2152)
Ensure that all pending sessions are picked by schedulers (#2155)
Fix user creation error when any model-store does not exists. (#2160)
Fix buggy resolver of model_card GQL Query. (#2161)
Fix security vulnerability for sudo_session_enabled (#2162)
Rename endpoints.model_mount_destiation to model_mount_destination (#2163)
Wait for real quota scope directory creation after Netapp create_qtree() call (#2170)
Fix wrong per-user concurrency calculation logic (#2175)
Keep sync_container_lifecycles() bgtask alive in a loop. (#2178)
Fix missing check for group (project) vfolder count limit and error handling with an invalid group parameter (#2190)
Fix model service persisting on degraded status forever in rare chance when trying to delete the service (#2191)
Fix error when query or mutate GraphQL using BigInt field type (#2203)
Ensure that utilization idleness is checked after a set period. (#2205)
Fix backend.ai ssh command execution when packaged as SCIE/PEX (#2226)
- fix endpoints query not working when trying to load image_row.aliases
- fix endpoints.status reporting PROVISIONING when its status is in DESTROYING state (#2233)
Fix GQL raising error when trying to resolve endpoints.errors field occasionally (#2236)
Fix ZeroDivisionError in volume usage calculation by returning 0% when volume capacity is zero (#2245)
Fix GraphQL to support query to non-installed images (#2250)
Add missing push_image method implementation to Dummy Agent (#2253)
Rename no-op access_key parameter of endpoint_list GQL Query to user_uuid (#2287)
Fix ai.backend.service-ports label syntax broken when image does not expose built-in service port (#2288)
Improve stability of untag_image_from_registry mutation (#2289)
SSH not working between kernels started with customized image (#2290)
Invalid container memory capacity reported (#2291)
Corrected an issue where the resource_policy field in the user model was incorrectly mapped to domain_name. (#2314)
Omit to clean containerless kernels which are still creating its container. (#2317)
Fix model service sessions created before 24.03.5 failing to spawn (#2318)
Image commit not working (#2319)
model service session scheduler (scale_services()) failing when sessions bound to active route already marked as terminated (#2320)
Fix container metric collection halted on systems with Cgroups v1 (#2321)
Run batch execution after the batch session starts. (#2327)
Add support for configuring sync_container_lifecycles() task. (#2338)
Fix mismatches between responses of /services/_runtimes and new model service creation input (#2371)
Fix incorrect check of values returned from docker stat API. (#2389)
Shutdown agent properly by removing a code that waits a cancelled task. (#2392)
Restrict GraphQL query to user_nodes field to require superadmin privilege (#2401)
Handle all possible exceptions when scheduling single node session so that the status information of pending session is not empty. (#2411)
Utilize ExtendedJSONEncoder for error logging to handle UUID objects in extra_data (#2415)
Change outdated references in event module from kernels to sessions. (#2421)
Upgrade inquirer to remove dependency on deprecated distutils, which breaks up execution of the scie builds (#2424)
Allow specific status of vfolders to query to purge. (#2429)
Update the install-dev scripts to use pnpm instead of npm to speed up installation and resolve some peculiar version resolution issues related to esbuild. (#2436)
Fix a packaging issue in the backendai-webserver scie executable due to missing explicit requirement of setuptools (#2454)
Improve pruning of non-physical filesystems when measuring disk usage in agents (#2460)
Update the install-dev scripts to install pnpm if pnpm isn't installed. (#2472)
Improve error handling of initialization failures in the kernel runner (#2478)
Fix BACKEND_MODEL_NAME environment always overwritten to model name specified at model definition (#2481)
Do not allow assigning preopen port which collides with image's own service port definition (#2482)
Fix GET requests with queryparams defined in API spec occasionally throwing 400 Bad Request error (#2483)
Check null value of user mutation by Undefined sentinel value rather than None. (#2506)
Do null check on groups.total_resource_slots and domains.total_resource_slots value. (#2509)
Fix hearbeat processing failing when agent reports image with its name not compilant to Backend.AI's naming rule (#2516)
Corrected a typo (maanger corrected to manager) in the check_status() API response of the storage component (#2523)
Rename images.image_filters GQL Query argument to images.image_types (#2555)
Prevent session status from being transit to PULLING status event if image pull is not required (#2556)
Prevent other user's customized image from being listed as a response of images GQL query (#2557)
skip resolving malformed ModelCard GQL item (#2570)
Delete sessions DB records when purging project. (#2573)
Initialize Redis connection pool objects with specified connection opts rather than ignoring them. (#2574)
Fix GET /func/folders/{folderName} API returning string literal "null" instead of null value on user and group fields (#2584)
Update GQLPrivilegeCheckMiddleware to align with upstream changes on graphql-core package (#2598)
Robust type check when idle checker fetches utilization data. (#2601)
Skip mounting zero-byte lxcfs files when lxcfs is activated to prevent crashes in session containers (#2604)
Fix typo in minilang query field spec and column map. (#2605)
Remove duplicate CPU quota arguments when creating containers (#2608)
Increase MAX_CMD_LEN of dropbear to improve compatibility with PyCharm debugger (#2613)
Silence falsy Redis timeout warnings when retrying blocking commands if the timeout does not exceed the expected command timeout (#2632)
Fix a regression of #2483 in the session-download API used by the backend.ai ssh command (#2635)
Implement missing StrEnumType handling in populate_fixture(). (#2648)
Let GET /resource/usage/period request contain data in query parameter rather than JSON body. (#2661)
Allow sudo-enabled container users to ovewrite /usr/bin/scp and /usr/libexec/sftp-server by unifying the intrinsic ssh binaries to use the merged dropbearmulti executable. (#2667)
Update webserver logout API to respond with HTTP 200 OK (#2681)
Fix WSProxy not properly handling WebSocket request sent from Firefox (#2684)
Scan parent directory of created qtree to avoid creating quota on non-existing directory. (#2696)
Fix list_files, get_fstab_contents, get_performance_metric and shared_vfolder_info Python SDK function not working with ValidationError exception printed (#2706)
Resolve the issue where the vfolder id does not match in list_shared_vfolders. (#2731)
Handle OS Error when deleting vfolders. (#2741)
Fix typo in Virtual-folder status update code. (#2742)
Correct msgpack deserialization of ResourceSlot. (#2754)
Fix regression error of session create_from_template command. (#2761)
Silence model_ namespace warnings with pydantic-based model classes (#2765)
Change the initialization order of PackageContext to apply target_path correctly in the TUI installer (#2768)
Make the regex patterns to update configuration files working with multiline texts correctly in the TUI installer (#2771)
Omit null parameter when call usage-per-period API. (#2777)
Delete vfolder invitation and permission rows when deleting vfolders. (#2780)
Handle container port mismatch when creating kernel. (#2786)
Explicitly set the protected service ports depending on the resource group type and the service types (#2797)
Correct session status determiner function. (#2803)
Fix endpoint_list.total_count GQL field returning incorrect value (#2805)
Fix Service.create() SDK method and service create CLI command not working with UnboundLocalError exception (#2806)
Refresh expiration time of login session when login. (#2816)
Fix kernel_id assignment for main kernel log retrieval (#2820)
Use a safer TLS version (v1.2) when creating SSL sockets in the logstash handler (#2827)
Wrong count of concurrent compute sessions. (#2829)
Create kernels with correct scaling_group value. (#2837)
Fix a regression in progress bar rendering of the TUI installer after upgrading the Textual library (#2867)

Documentation Updates

Add note about installing client library with same version as server (#1976)
Remove deprecated version from the docker compose YAML templates in package installation docs. (#2035)
Fix a typo in the agent.toml example of the package-based installation guide to have a duplicate double quote (#2069)

External Dependency Updates

Upgrade the base runtime (CPython) version from 3.11.6 to 3.12.2 (#1994)
Upgrade aiodocker to v0.22.0 with minor bug fixes found by improved type annotations (#2339)
Update the halfstack containers to point the latest stable versions (#2367)
Upgrade aiodocker to 0.22.1 to fix error handling when trying to extract the log of non-existing containers (#2402)
Upgrade the base CPython from 3.12.2 to 3.12.4 (#2449)
Upgrade Python (3.12.4 -> 3.12.6) and common/tool dependencies to prepare for Python 3.13 and apply latest fixes (#2851)

Miscellaneous

Wrap RPC authentication error to custom error for better logging. (#1970)
Add requested_slots field to compute session GQL type. (#1984)
Allow pydantic.BaseModel as the API handler return schema. (#1987)
Fix incorrect version notation of GQL Field. (#1993)
Add max_pending_session_count field to Keypair resource policy GQL schema (#2013)
Handle container creation exception and start exception in separate try-except contexts. (#2316)
Fix broken the workflow call for the action that auto-assigns PR numbers to news fragments (#2358)
Finally stabilize the hanging tests in our CI due to docker-internal races on TCP port mappings to concurrently spawned fixture containers by introducing monotonically increasing TCP port numbers (#2379)
Further improve the monotonic port allocation logic for the test containers to remove maximum concurrency restrictions (#2396)
Add PEX, SCIE binary build configs for the plugin subsystem. (#2422)
- Add POST /folders API endpoints to replace DELETE APIs that require request body.
- Allow DELETE requests to have body data. (#2571)
Enhacne type hints for potential None arguments (#2580)
Add ai.backend.manager.models.graphql module for better code base management. (#2669)
Remove Scheduler related types that are no longer used. (#2705)
Allow adding required GQL field argument to schema. (#2712)
Upgrade readthedocs build environment to Python 3.12 (#2814)## 24.03.0rc1 (2024-03-31)

Features

Allw filter compute_session query by user_id. (#1805)
Allow overriding vfolder mount permissions in API calls and CLI commands to create new sessions, with addition of a generic parser of comma-separated "key=value" list for CLI args and API params (#1838)
Always enable ai.backend.accelerator.cuda_open in the scie-based installer (#1966)
Use config["pipeline"]["endpoint"] as default value of config["pipeline"]["frontend-endpoint"] when not provided (#1972)
Migrate container registry config storage from Etcd to PostgreSQL (#1917)
Implement ID-based client workflow to ContainerRegistry API. (#2615)
Rafactor Base ContainerRegistry's scan_tag and implement MEDIA_TYPE_DOCKER_MANIFEST type handling. (#2620)
Support GitHub Container Registry. (#2621)
Support GitLab Container Registry. (#2622)
Support AWS ECR Public Container Registry. (#2623)
Support AWS ECR Private Container Registry. (#2624)
Replace rescan command's --local flag with local container registry record. (#2665)
Add project column to the images table and refactoring ImageRef logic. (#2707)
Support docker image manifest v2 schema1. (#2815)
Add filter and order parameters to Group GQL Relay API. (#2863)
Add vast_use_auth_token config to utilize VASTData API token optionally. (#2901)
Use a valid value for the id field in the GQL schema query resolver for ContainerRegistry. (#2908)

Fixes

Set single agent per kernel resource usage. (#1725)
Abort container creation when duplicate container port definition exists (#1750)
To update image metadata, check if the min/max values in resource_limits are undefined. (#1941)
Explicitly disable the user-site package detection in the krunner python commands to avoid potential conflicts with user-installed packages in .local directories (#1962)
Fix caf54fcc17ab migration to drop a primary key only if it exists and in 589c764a18f1 migration, add missing table arguments. (#1963)
Explicitly wait for readiness of the Docker daemon and the compose stack before pouring database fixtures in install-dev.sh for when installing at the provisioning stage of Codespaces and integration tests in CI. (#2378)
Add missing implementation of wsproxy and manager CLI's log-level customization options (#2698)
Add missing batch execution call after session starts (#2884)
Fix a regression of the unicode-aware slug update that prevented creation of dot-prefixed (automount) vfolders (#2892)
Fix invalid image format log spam in Agent (#2894)
Fix wrong creation of raw_configs in _create_kernels_in_one_agent (#2896)
Assign valid value to id field in ContainerRegistryNode GQL schema query resolver. (#2899)
Update vast quota rather than raise error when quota exists. (#2900)
Calculate correct expiration time of VAST auth token and add vast_force_login config to enable login before every REST API call (#2911)

Documentation Updates

Update docstrings in ai.backend.client.request.Request:fetch() and ai.backend.client.request.FetchContextManager as the support for synchronous context manager has been deprecated. (#1801)
Resize font-size of footer text in ethical ads in documentation hosted by read-the-docs (#1965)
Only resize font-size of footer text in ethical ads not in title of content in documentation (#1967)

Miscellaneous

Revert response type of service create API. (#1979)

Full Changelog

Check out the full changelog until this release (24.09.0).

Full Commit Logs

Check out the full commit logs between release (24.09.0rc1) and (24.09.0).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

24.09.0

Features

Improvements

Deprecations

Fixes

Documentation Updates

External Dependency Updates

Miscellaneous

Features

Fixes

Documentation Updates

Miscellaneous

Full Changelog

Full Commit Logs