Releases · kubernetes-sigs/kueue

09 Feb 17:11

alculquicondor

v0.5.3

f816b7f

Kueue v0.5.3

Changes since v0.5.2:

Changes by Kind

Bug or Regression

Avoid finished Workloads from blocking quota after a Kueue restart (#1699, @trasc)
Do not (re)create ProvReq if the state of admission check is Ready (#1620, @mimowo)
Fix Kueue crashing at the log level 6 when re-admitting workloads (#1645, @mimowo)
Kueue replicas are advertised as Ready only once the webhooks are functional.

This allows users to wait with the first requests until the Kueue deployment is available, so that the early requests don't fail. (#1682 #1713, @mimowo @trasc)
Remove deleted pending workloads from the cache (#1687, @astefanutti)

Contributors

astefanutti, mimowo, and trasc

Assets 6

07 Feb 21:23

alculquicondor

v0.6.0-rc.2

90fa327

Kueue v0.6.0-rc.2 Pre-release

Pre-release

Changes since v0.5.0:

Changes by Kind

API Change

Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)

Feature

A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)
Add HA support for the visibility API (#1554, @astefanutti)
Add MultiKueue garbage collection. (#1643, @trasc)
Add MultiKueue support for JobSet (#1606, @trasc)
Add Path location type for MultiKueue cluster KubeConfigs (#1640, @trasc)
Add Prebuilt Workload support for JobSets. (#1575, @trasc)
Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)
Add live status updates for multikueue jobs (#1668, @trasc)
Add prebuilt workload support for batch/job. (#1358, @trasc)
Add support for groups of plain Pods. (#1319, @achernevskii)
Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)
Allow configuring featureGates on helm charts. (#1314, @B1F030)
Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)
At log level 6, the usage of ClusterQueues and cohorts is included in logs.

The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)
Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)
Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)
Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)
Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)
RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)
Support RayCluster as a queue-able workload in Kueue (#1520, @vicentefb)
Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)
Support for preemption while borrowing (#1397, @mimowo)
Support for retry of provisioning request.

When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)
The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)
The leaderElection field in the Configuration API is now defaulted.
Leader election is now enabled by default. (#1598, @astefanutti)
The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)
Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)

Documentation

Add release manifest with AllAlpha=true (#1696, @andrewsykim)
Adds documentation for RayCluster integration with Kueue (#1607, @vicentefb)

Bug or Regression

Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)
Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)
Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)
Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)
Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)
Do not (re)create ProvReq is the state of admission check is Ready (#1617, @mimowo)
Fix Kueue crashing at the log level 6 when re-admitting workloads (#1644, @mimowo)
Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)
Fix a bug that a workload, representing a pod group, was deleted soon after being marked as finished.
This affected which were preempted during their lifetime. (#1683, @mimowo)
Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)
Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)
Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)
Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)
Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)
Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)
Kueue replicas are advertised as Ready only once the webhooks are functional.

This allows users to wait with the first requests until the Kueue deployment is available, so that the
early requests don't fail. (#1676, @mimowo)
Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)
Remove deleted pending workloads from the cache (#1679, @astefanutti)
Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)
Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)
Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)
Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

Adding toleration to a job leads to update workload (#1304, @stuton)
Expose utilization functions to setup jobframework reconcilers and webhooks (#1630, @tenzen-y)

Contributors

astefanutti, alculquicondor, and 12 other contributors

Assets 7

23 Jan 19:14

alculquicondor

v0.6.0-rc.1

44adc22

Kueue v0.6.0-rc.1 Pre-release

Pre-release

Changes since v0.5.0:

Changes by Kind

API Change

Add the config field .waitForPodsReady.requeuingTimestamp to allow admins configure the timestamp used when sorting workloads that were evicted due to their Pods not becoming ready in time. (#1542, @nstogner)
Extend the information returned for the pending workloads in cluster queue, that is used to determine the workload position, including the workload position itself. (#1362, @PBundyra)
Extend visibility API by adding an endpoint that allows a user to fetch information about pending workloads and their position in LocalQueue. (#1365, @PBundyra)
Introduces an on-demand API endpoint for fetching pending workloads in a cluster queue (#1251, @PBundyra)
The OwnerReferences field in PendingWorkload's metadata is now filled with the information about the owning Job (#1378, @PBundyra)
Visibility.PendingWorkload does not implement runtime.Object interface anymore (#1386, @PBundyra)

Feature

A stopPolicy field in the ClusterQueue allows to hold or drain a ClusterQueue (#1299, @trasc)
Add MultiKueue support for JobSet (#1606, @trasc)
Add Prebuilt Workload support for JobSets. (#1575, @trasc)
Add events for transitions of the provisioning AdmissionCheck (#1271, @stuton)
Add prebuilt workload support for batch/job. (#1358, @trasc)
Add support for groups of plain Pods. (#1319, @achernevskii)
Add validation for clusterQueue: when cohort is empty, borrowingLimit must be nil. (#1525, @B1F030)
Allow configuring featureGates on helm charts. (#1314, @B1F030)
Allow decrease reclaimable pods to 0 for suspended job (#1277, @yaroslava-serdiuk)
At log level 6, the usage of ClusterQueues and cohorts is included in logs.

The status of the internal cache and queues is also logged on demand when a SIGUSR2 is sent to kueue, regardless of the log level. (#1528, @alculquicondor)
Basic implementation of MultiKueue for Job. This doesn't include support for live status updates. (#1313, @trasc)
Increase the default number of reconcilers for Pod and Workload objects to 5, each. (#1589, @alculquicondor)
Jobs preserve their position in the queue if the number of pods change before being admitted (#1223, @yaroslava-serdiuk)
Make the image build setting CGO_ENABLED configurable (#1391, @anishasthana)
RBAC to visibility into Local Queues is fixed (#1412, @PBundyra)
Support for a mechanism to suspend a running Job without requeueing (#1252, @vicentefb)
Support for preemption while borrowing (#1397, @mimowo)
Support for retry of provisioning request.

When ProvisioningACC is enabled, and there are existing ProvisioningRequests, they are going to be recreated.
This may cause a job failures for some long-running jobs which were using the ProvisioningRequests. (#1351, @mimowo)
The image gcr.io/k8s-staging-kueue/debug:main, along with the script ./hack/dump_cache.sh can be used to trigger a dump of the internal cache into the logs. (#1541, @alculquicondor)
The leaderElection field in the Configuration API is now defaulted.
Leader election is now enabled by default. (#1598, @astefanutti)
The priority sorting within the cohort could be disabled by setting --prioritySortingWithinCohort to false (#1406, @yaroslava-serdiuk)
Visibility.PendingWorkload object has the metav1.CreationTimestamp field filled with the value of corresponding kueue.Workload (#1404, @PBundyra)

Bug or Regression

Add Missing RBAC on integration finalizers sub-resources (#1486, @astefanutti)
Add Mutating WebhookConfigurations for the AdmissionCheck, RayJob, and JobSet to helm charts (#1567, @B1F030)
Add Validating/Mutating WebhookConfigurations for the KubeflowJobs like PyTorchJob (#1460, @tenzen-y)
Added event for QuotaReserved and fixed event for Admitted to trigger when admission checks complete (#1436, @trasc)
Avoid recreating a Workload for a finished Job and finalize a job when the workload is declared finished. (#1383, @achernevskii)
Do not (re)create ProvReq is the state of admission check is Ready (#1617, @mimowo)
Fix a bug in the pod integration that unexpected errors will occur when the pod isn't found (#1512, @achernevskii)
Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)
Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)
Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor)
Fix handling of preemption within a cohort when there is no borrowingLimit. In that case,
during preemption, the permitted resources to borrow were calculated as if borrowingLimit=0, instead of unlimited.

As a consequence, when using reclaimWithinCohort, it was possible that a workload, scheduled to ClusterQueue with no borrowingLimit, would preempt more workloads than needed, even though it could fit by borrowing. (#1561, @mimowo)
Fix the synchronization of the admission check state based on the second provisioning request (#1585, @mimowo)
Fixed fungibility policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)
Pending workload from StrictFIFO ClusterQueue doesn't block borrowing from other ClusterQueues (#1399, @yaroslava-serdiuk)
Remove finalizer from Workloads that are orphaned (have no owners). (#1523, @achernevskii)
Trigger an eviction for an admitted Job after an admission check changed state to Rejected. (#1562, @trasc)
Visibility endpoints return 404 code for non-existent queues (#1415, @PBundyra)
Webhooks are served in non-leading replicas (#1509, @astefanutti)

Other (Cleanup or Flake)

Adding toleration to a job leads to update workload (#1304, @stuton)

Contributors

astefanutti, alculquicondor, and 11 other contributors

Assets 6

28 Nov 20:01

alculquicondor

v0.5.1

8b9b1e8

Kueue v0.5.1

Changes since v0.5.0:

Bug or Regression

Fix client-go libraries bug that can not operate clusterScoped resources like ClusterQueue and ResourceFlavor. (#1294, @tenzen-y)
Fixed fungiblity policy whenCanPreempt: Preempt. The admission should happen in the flavor for which preemptions were issued. (#1332, @alculquicondor)
Fix a bug that plain pods managed by kueue will remain a terminating condition forever. (#1342, @tenzen-y)
Fix fungibility policy Preempt where it was not able to utilize the next flavor if preemption was not possible. (#1366, @alculquicondor, @KunWuLuan)

Contributors

alculquicondor, KunWuLuan, and tenzen-y

Assets 6

25 Oct 21:39

alculquicondor

v0.5.0

739ebb1

Kueue v0.5.0

Changes since v0.4.0:

Highlights

AdmissionChecks: a mechanism for internal or external components to influence whether a Workload can be admitted.
Integration with cluster-autoscaler's ProvisioningRequest via AdmissionChecks.
Information about pending workloads in a ClusterQueue status.
Metrics for resource usage of ClusterQueues and LocalQueues.
Policy to control whether to preempt or borrow before trying the next flavors.
Partial admission graduated to Beta.
Workload priority, independent from Pod priority.
New integrations:
- All Kubeflow training APIs
- Single plain Pods

Changes by Kind

Feature

A mechanism for AdmissionChecks to provide labels, annotations, tolerations and node selectors to the pod templates when starting a job (#1180, @mimowo)
A reference standalone controller that can be used to support plain Pods using taints and tolerations, which can be used in Kubernetes versions that don't support scheduling gates. (#1111, @nstogner)
Add Active condition to AdmissionChecks (#1193, @trasc)
Add optional cluster queue resource quota and usage metrics. (#982, @trasc)
Add support for AdmissionChecks, a mechanism for internal or external components to influence whether a Workload can be admitted. (#1045, @trasc)
Add support for single plain Pods. (#1072, @achernevskii)
Add support for workload Priority (#1081, @Gekko0114)
Add tolerations to ResourceFlavor. Kueue injects these tolerations to the jobs that are assigned to the flavor when admitted. (#1248, @trasc)
Added pprof endpoints for profiling (#978, @stuton)
Allow the admission of multiple workloads within one scheduling cycle while borrowing. (#1039, @trasc)
An option to synchronize batch/job.completions with parallelism in case of partial admission (#971, @trasc)
Expose cluster queue information about pending workloads (#1069, @stuton)
Expose probe configurations to helm chart (#986, @yyzxw)
Graduate Partial admission to Beta. (#1221, @trasc)
Integrate with Cluster Autoscaler's ProvisioningRequest via two stage admission (#1154, @trasc)
Manage cluster queue active state based on admission checks life cycle. (#1079, @trasc)
Metrics for usage and reservations in ClusterQueues and LocalQueues. (#1206, @trasc)
Options to allow workloads to borrow quota or preempt other workloads before trying the next flavor in the list (#849, @KunWuLuan)
Support kubeflow.org/mxjob (#1183, @tenzen-y)
Support kubeflow.org/paddlejob (#1142, @tenzen-y)
Support kubeflow.org/pytorchjob (#995, @tenzen-y)
Support kubeflow.org/tfjob (#1068, @tenzen-y)
Support kubeflow.org/xgboostjob (#1114, @tenzen-y)
Workload objects have the label kueue.x-k8s.io/job-uid where the value matches the uid of the parent job, whether that's a Job, MPIJob, RayJob, JobSet (#1032, @achernevskii)

Bug or Regression

Adjust resources (based on LimitRanges, PodOverhead and resource limits) on existing Workloads when a LocalQueue is created (#1197, @alculquicondor)
Ensure the ClusterQueue status is updated as the number of pending workloads changes. (#1135, @mimowo)
Fix resuming of RayJob after preempted. (#1156, @kerthcet)
Fixed missing create verb for webhook (#1035, @stuton)
Fixed scheduler to only allow one admission or preemption per cycle within a cohort that has ClusterQueues borrowing quota (#1023, @alculquicondor)
Helm: Enable the JobSet integration by default (#1184, @tenzen-y)
Improve job controller to be resilient to API failures during preemption (#1005, @alculquicondor)
Prevent workloads in ClusterQueue with StrictFIFO from blocking higher priority workloads in other ClusterQueues in the same cohort that require preemption (#1024, @alculquicondor)
Terminate Kueue when there is an internal failure during setup, so that it can be retried. (#1077, @alculquicondor)

Other (Cleanup or Flake)

Add client-go library for AdmissionCheck (#1104, @tenzen-y)
Add mergeStrategy:merge to all conditions of API objects (#1089, @alculquicondor)
Update ray-operator to v0.6.0 (#1231, @lowang-bh)

Contributors

alculquicondor, nstogner, and 10 other contributors

Assets 6

11 Oct 20:01

alculquicondor

v0.4.2

417b060

Kueue v0.4.2

Changes since v0.4.1:

Bug or Regression

Adjust resources (based on LimitRanges, PodOverhead and resource limits) on existing Workloads when a LocalQueue is created (#1197, @alculquicondor)
Fix resuming of RayJob after preempted. (#1190, @kerthcet)

Contributors

alculquicondor and kerthcet

Assets 6

15 Aug 13:40

alculquicondor

v0.4.1

328bb66

Kueue v0.4.1

Bug or Regression

Fixed missing create verb for webhook (#1053, @stuton)
Fixed scheduler to only allow one admission or preemption per cycle within a cohort that has ClusterQueues borrowing quota (#1029, @alculquicondor)
Prevent workloads in ClusterQueue with StrictFIFO from blocking higher priority workloads in other ClusterQueues in the same cohort that require preemption (#1030, @alculquicondor)

Contributors

alculquicondor and stuton

Assets 6

07 Jul 14:41

alculquicondor

v0.4.0

5cc79d1

Kueue v0.4.0

Changes since v0.3.0:

API Change

Report resource usage in LocalQueue. (#737, @tenzen-y)

Feature

Add client-go libraries. (#789, @tenzen-y)
Add support for Kuberay's RayJobs. (#667, @trasc)
Add support for dynamic reclaim in the JobSet integration. (#901, @trasc)
Add support for partial workload admission (#771, @trasc)
Add the support for dynamic resources reclaim. (#756, @trasc)
Allow scheduler to admit more jobs when the head job have not reached the PodReady=true status. (#708, @KunWuLuan)
Allow specifying the manager pod and container security context instead of hardcoded values (#878, @bh-tt)
Feature gates for alpha/experimental features is introduced to Kueue Project. (#788, @kerthcet)
Ignoring integrations if crd wasn't installed otherwise all integrations are enabled by default (#883, @stuton)
Integrate JobSet into kueue (#762, @mcariatm)

Bug or Regression

Add permission to update frameworkjob status. (#797, @tenzen-y)
Fix a bug that updates events for clusterQueues are created endlessly. (#907, @tenzen-y)
Fix a bug where a child batch/job of an unmanaged parent (doesn't have queue name) was being suspended. (#835, @tenzen-y)
Fix panic in cluster queue if resources and coveredResources do not have the same length. (#787, @kannon92)
Fix: Enforce borrowed=0 if ClusterQueue doesn't belong to a cohort. (#759, @tenzen-y)
Fix: Potential over-admission within cohort when borrowing. (#805, @trasc)
Fixed preemption to prefer preempting workloads that were more recently admitted. (#843, @stuton)
Fixed the suspend=true add to the job/mpijob by the default webhook has not taken effect. (#758, @fjding)

Other (Cleanup or Flake)

Add validation for child jobs without ownerReference. (#865, @tenzen-y)

Contributors

kannon92, stuton, and 7 other contributors

Assets 6

13 Jun 14:51

alculquicondor

v0.3.2

ff63c63

Kueue v0.3.2

Changes since v0.3.1:

Bug or Regression

Add permission to update frameworkjob status. (#798, @tenzen-y)
Fix a bug where a child batch/job of an unmanaged parent (doesn't have queue name) was being suspended. (#839, @tenzen-y)
Fix panic in cluster queue if resources and coveredResources do not have the same length. (#799, @kannon92)
Fix: Potential over-admission within cohort when borrowing. (#822, @trasc)
Fixed preemption to prefer preempting workloads that were more recently admitted. (#845, @stuton)

Contributors

kannon92, stuton, and 2 other contributors

Assets 5

16 May 18:55

alculquicondor

v0.3.1

50f628a

Kueue v0.3.1

Changes since v0.3.0:

Bug fixes

Fix a bug that the validation webhook doesn't validate the queue name set as a label when creating MPIJob. #711
Fix a bug that updates a queue name in workloads with an empty value when using framework jobs that use batch/job internally, such as MPIJob. #713
Fix a bug in which borrowed values are set to a non-zero value even though the ClusterQueue doesn't belong to a cohort. #761
Fixed adding suspend=true job/mpijob by the default webhook. #765

Assets 5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes by Kind

Bug or Regression

Contributors

Changes by Kind

API Change

Feature

Documentation

Bug or Regression

Other (Cleanup or Flake)

Contributors

Changes by Kind

API Change

Feature

Bug or Regression

Other (Cleanup or Flake)

Contributors

Bug or Regression

Contributors

Highlights

Changes by Kind

Feature

Bug or Regression

Other (Cleanup or Flake)

Contributors

Bug or Regression

Contributors

Bug or Regression

Contributors

API Change

Feature

Bug or Regression

Other (Cleanup or Flake)

Contributors

Bug or Regression

Contributors

Bug fixes

Releases: kubernetes-sigs/kueue

Kueue v0.5.3

Changes by Kind

Bug or Regression

Contributors

Kueue v0.6.0-rc.2

Changes by Kind

API Change

Feature

Documentation

Bug or Regression

Other (Cleanup or Flake)

Contributors

Kueue v0.6.0-rc.1

Changes by Kind

API Change

Feature

Bug or Regression

Other (Cleanup or Flake)

Contributors

Kueue v0.5.1

Bug or Regression

Contributors

Kueue v0.5.0

Highlights

Changes by Kind

Feature

Bug or Regression

Other (Cleanup or Flake)

Contributors

Kueue v0.4.2

Bug or Regression

Contributors

Kueue v0.4.1

Bug or Regression

Contributors

Kueue v0.4.0

API Change

Feature

Bug or Regression

Other (Cleanup or Flake)

Contributors

Kueue v0.3.2

Bug or Regression

Contributors

Kueue v0.3.1

Bug fixes