diff --git a/CHANGELOG.md b/CHANGELOG.md index 84f7b62348..04ed39da83 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,29 +1,143 @@ # Changelog - -## [v1.7.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.7.0-rc.0) (2023-07-07) +# [v1.8.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.8.0-rc.0) (2024-04-28) + +## Breaking Changes + +- Support K8s v1.29 and Drop K8s v1.26 ([#2039](https://github.com/kubeflow/training-operator/pull/2039) by [@tenzen-y](https://github.com/tenzen-y)) +- Support K8s v1.28 and Drop K8s v1.25 ([#2038](https://github.com/kubeflow/training-operator/pull/2038) by [@tenzen-y](https://github.com/tenzen-y)) +- Deprecation Notice for MXJob ([#2058](https://github.com/kubeflow/training-operator/pull/2058) by [@tenzen-y](https://github.com/tenzen-y)) +- ⚠️ Breaking Changes: Rename `monitoring-port` flag to `webook-server-port` ([#1925](https://github.com/kubeflow/training-operator/pull/1925) by [@afritzler](https://github.com/afritzler)) + +## New Features + +### LLM Fine-Tuning API + +- Train/Fine-tune API Proposal for LLMs ([#1945](https://github.com/kubeflow/training-operator/pull/1945) by [@deepanker13](https://github.com/deepanker13)) +- [SDK] Train API for LLM Fine-Tuning ([#1962](https://github.com/kubeflow/training-operator/pull/1962) by [@deepanker13](https://github.com/deepanker13)) +- Modify LLM Trainer to support BERT and Tiny LLaMA ([#2031](https://github.com/kubeflow/training-operator/pull/2031) by [@andreyvelich](https://github.com/andreyvelich)) +- Support arm64 for Hugging Face trainer ([#2028](https://github.com/kubeflow/training-operator/pull/2028) by [@tariq-hasan](https://github.com/tariq-hasan)) +- Add Fine-Tune BERT LLM Example ([#2021](https://github.com/kubeflow/training-operator/pull/2021) by [@andreyvelich](https://github.com/andreyvelich)) +- Train api dataset download changes ([#1959](https://github.com/kubeflow/training-operator/pull/1959) by [@deepanker13](https://github.com/deepanker13)) +- Train api init container creation ([#1958](https://github.com/kubeflow/training-operator/pull/1958) by [@deepanker13](https://github.com/deepanker13)) +- [SDK] Add docstring for Train API ([#2075](https://github.com/kubeflow/training-operator/pull/2075) by [@andreyvelich](https://github.com/andreyvelich)) + +### Control Plane Updates + +- Upgrade scheduler-plugins to v0.28.9 ([#2065](https://github.com/kubeflow/training-operator/pull/2065) by [@tenzen-y](https://github.com/tenzen-y)) +- Implement webhook validations for the PaddleJob ([#2057](https://github.com/kubeflow/training-operator/pull/2057) by [@tenzen-y](https://github.com/tenzen-y)) +- Implement webhook validations for the XGBoostJob ([#2052](https://github.com/kubeflow/training-operator/pull/2052) by [@tenzen-y](https://github.com/tenzen-y)) +- Implement webhook validation for the TFJob ([#2051](https://github.com/kubeflow/training-operator/pull/2051) by [@tenzen-y](https://github.com/tenzen-y)) +- Implement webhook validations for the PyTorchJob ([#2035](https://github.com/kubeflow/training-operator/pull/2035) by [@tenzen-y](https://github.com/tenzen-y)) +- Upgrade PyTorchJob examples to PyTorch v2 ([#2024](https://github.com/kubeflow/training-operator/pull/2024) by [@champon1020](https://github.com/champon1020)) +- Upgrade Go version to v1.22 ([#2046](https://github.com/kubeflow/training-operator/pull/2046) by [@tenzen-y](https://github.com/tenzen-y)) + +### SDK Improvements + +- [SDK] Add resources per worker for Create Job API ([#1990](https://github.com/kubeflow/training-operator/pull/1990) by [@andreyvelich](https://github.com/andreyvelich)) +- [SDK] Fix Worker and Master templates for PyTorchJob ([#1988](https://github.com/kubeflow/training-operator/pull/1988) by [@andreyvelich](https://github.com/andreyvelich)) +- [SDK] Get Kubernetes Events for Job ([#1975](https://github.com/kubeflow/training-operator/pull/1975) by [@andreyvelich](https://github.com/andreyvelich)) +- SDK: Upgrade the minimum required Kubernetes version to v1.27.2 ([#2066](https://github.com/kubeflow/training-operator/pull/2066) by [@tenzen-y](https://github.com/tenzen-y)) +- [SDK] Add information about TrainingClient logging ([#1973](https://github.com/kubeflow/training-operator/pull/1973) by [@andreyvelich](https://github.com/andreyvelich)) +- Training operator SDK unit test ([#1938](https://github.com/kubeflow/training-operator/pull/1938) by [@deepanker13](https://github.com/deepanker13)) +- [SDK] Consolidate Naming for CRUD APIs ([#1907](https://github.com/kubeflow/training-operator/pull/1907) by [@andreyvelich](https://github.com/andreyvelich)) + +## Bug Fixes + +- Fix import for HuggingFace Dataset Provider ([#2085](https://github.com/kubeflow/training-operator/pull/2085) by [@andreyvelich](https://github.com/andreyvelich)) +- Updated examples for train API ([#2077](https://github.com/kubeflow/training-operator/pull/2077) by [@shruti2522](https://github.com/shruti2522)) +- Fail job for non-retryable exit codes ([#2071](https://github.com/kubeflow/training-operator/pull/2071) by [@kellyaa](https://github.com/kellyaa)) +- E2E: Replace outdated images with latest ones ([#2083](https://github.com/kubeflow/training-operator/pull/2083) by [@tenzen-y](https://github.com/tenzen-y)) +- fix wrong filepath in the simple example command ([#2062](https://github.com/kubeflow/training-operator/pull/2062) by [@qzoscar](https://github.com/qzoscar)) +- fix(example): add installation of python-etcd in Pytorch example ([#2064](https://github.com/kubeflow/training-operator/pull/2064) by [@champon1020](https://github.com/champon1020)) +- fix: Upgrade controller-gen to v0.14.0 ([#2026](https://github.com/kubeflow/training-operator/pull/2026) by [@champon1020](https://github.com/champon1020)) +- Fix build workflow config for pytorch-torchrun-example ([#2020](https://github.com/kubeflow/training-operator/pull/2020) by [@PeterWrighten](https://github.com/PeterWrighten)) +- Fix Distributed Data Samplers in PyTorch Examples ([#2012](https://github.com/kubeflow/training-operator/pull/2012) by [@andreyvelich](https://github.com/andreyvelich)) +- Fix URL in python SDK setup.py ([#2011](https://github.com/kubeflow/training-operator/pull/2011) by [@garymm](https://github.com/garymm)) +- Fix for Github CI to publish HF trainer image ([#1987](https://github.com/kubeflow/training-operator/pull/1987) by [@johnugeorge](https://github.com/johnugeorge)) +- train api jupyternotebook fix ([#1984](https://github.com/kubeflow/training-operator/pull/1984) by [@deepanker13](https://github.com/deepanker13)) +- fix: volcano podgroup should has a non-empty queue name ([#1977](https://github.com/kubeflow/training-operator/pull/1977) by [@lowang-bh](https://github.com/lowang-bh)) +- Fix Master Label for PyTorchJob ([#1974](https://github.com/kubeflow/training-operator/pull/1974) by [@andreyvelich](https://github.com/andreyvelich)) +- IsMasterRole fix in pytorchjob controller ([#1969](https://github.com/kubeflow/training-operator/pull/1969) by [@deepanker13](https://github.com/deepanker13)) +- [fix] replace ${go env GOPATH} with $(go env GOPATH) to get the prope… ([#1952](https://github.com/kubeflow/training-operator/pull/1952) by [@double12gzh](https://github.com/double12gzh)) +- Fixing issues with providing existing service account ([#1918](https://github.com/kubeflow/training-operator/pull/1918) by [@rpemsel](https://github.com/rpemsel)) + +## Misc + +- Update training operator image to latest ([#2089](https://github.com/kubeflow/training-operator/pull/2089) by [@johnugeorge](https://github.com/johnugeorge)) +- Update sdk to v1.8.0rc0 ([#2087](https://github.com/kubeflow/training-operator/pull/2087) by [@johnugeorge](https://github.com/johnugeorge)) +- Test: Simplify and Identify pod-controller envtest ([#2084](https://github.com/kubeflow/training-operator/pull/2084) by [@tenzen-y](https://github.com/tenzen-y)) +- Remove deadcode related to PodDisruptionBudget ([#2073](https://github.com/kubeflow/training-operator/pull/2073) by [@tenzen-y](https://github.com/tenzen-y)) +- docs: updating docs for local development ([#2074](https://github.com/kubeflow/training-operator/pull/2074) by [@franciscojavierarceo](https://github.com/franciscojavierarceo)) +- PyTorchJob: Always show warnings when using elasticPolicy.nProcPerNode ([#2067](https://github.com/kubeflow/training-operator/pull/2067) by [@tenzen-y](https://github.com/tenzen-y)) +- Updated developer docs to include Kind ([#2061](https://github.com/kubeflow/training-operator/pull/2061) by [@franciscojavierarceo](https://github.com/franciscojavierarceo)) +- adding fine tune example with s3 as the dataset store ([#2006](https://github.com/kubeflow/training-operator/pull/2006) by [@deepanker13](https://github.com/deepanker13)) +- CI: Use a mode=min in the builder cache ([#2053](https://github.com/kubeflow/training-operator/pull/2053) by [@tenzen-y](https://github.com/tenzen-y)) +- Fix: upgrade version of crd-ref-docs, which caused panic with go v1.22 ([#2043](https://github.com/kubeflow/training-operator/pull/2043) by [@jdcfd](https://github.com/jdcfd)) +- Remove Dockerfile.ppc64le of pytorch example ([#2042](https://github.com/kubeflow/training-operator/pull/2042) by [@champon1020](https://github.com/champon1020)) +- publish torchrun example via Dockerfile ([#2018](https://github.com/kubeflow/training-operator/pull/2018) by [@PeterWrighten](https://github.com/PeterWrighten)) +- Updated examples/pytorch to disable istio sidecar injection ([#2004](https://github.com/kubeflow/training-operator/pull/2004) by [@jdcfd](https://github.com/jdcfd)) +- [docs] development guide update ([#1995](https://github.com/kubeflow/training-operator/pull/1995) by [@shashank-iitbhu](https://github.com/shashank-iitbhu)) +- Add Kubeflow Website links to README ([#1983](https://github.com/kubeflow/training-operator/pull/1983) by [@andreyvelich](https://github.com/andreyvelich)) +- publish trainer hugging face image ([#1985](https://github.com/kubeflow/training-operator/pull/1985) by [@deepanker13](https://github.com/deepanker13)) +- Adding Training image needed for train api ([#1963](https://github.com/kubeflow/training-operator/pull/1963) by [@deepanker13](https://github.com/deepanker13)) +- Add test to create PyTorchJob from func ([#1979](https://github.com/kubeflow/training-operator/pull/1979) by [@andreyvelich](https://github.com/andreyvelich)) +- Corrected Some Spelling And Grammatical Errors ([#1980](https://github.com/kubeflow/training-operator/pull/1980) by [@daniel-hutao](https://github.com/daniel-hutao)) +- torchrun example with cpu version pytorch ([#1965](https://github.com/kubeflow/training-operator/pull/1965) by [@kuizhiqing](https://github.com/kuizhiqing)) +- utils changes needed to add train api ([#1954](https://github.com/kubeflow/training-operator/pull/1954) by [@deepanker13](https://github.com/deepanker13)) +- Adding parallel support for coveralls ([#1956](https://github.com/kubeflow/training-operator/pull/1956) by [@johnugeorge](https://github.com/johnugeorge)) +- chore: pkg import only once ([#1950](https://github.com/kubeflow/training-operator/pull/1950) by [@testwill](https://github.com/testwill)) +- fix nproc env in elastic mode for pytorchjob ([#1948](https://github.com/kubeflow/training-operator/pull/1948) by [@kuizhiqing](https://github.com/kuizhiqing)) +- Avoid modifying log level globally ([#1944](https://github.com/kubeflow/training-operator/pull/1944) by [@droctothorpe](https://github.com/droctothorpe)) +- Add @andreyvelich to Approvers ([#1941](https://github.com/kubeflow/training-operator/pull/1941) by [@andreyvelich](https://github.com/andreyvelich)) +- Merge v1.7 branch changes to Main ([#1940](https://github.com/kubeflow/training-operator/pull/1940) by [@johnugeorge](https://github.com/johnugeorge)) +- Increase the root volume size on the github runner when building container images ([#1931](https://github.com/kubeflow/training-operator/pull/1931) by [@tenzen-y](https://github.com/tenzen-y)) +- Check podGroup CRD for the volcano and the scheudler-plugins as default. ([#1929](https://github.com/kubeflow/training-operator/pull/1929) by [@Syulin7](https://github.com/Syulin7)) +- Use a community hosted image in MXJob E2E ([#1928](https://github.com/kubeflow/training-operator/pull/1928) by [@tenzen-y](https://github.com/tenzen-y)) +- Build MXJob examples in CI ([#1927](https://github.com/kubeflow/training-operator/pull/1927) by [@tenzen-y](https://github.com/tenzen-y)) +- Bump `k8s.io/*` deps to 1.28 ([#1920](https://github.com/kubeflow/training-operator/pull/1920) by [@afritzler](https://github.com/afritzler)) +- Replace XGBoost image for E2E with community hosted ([#1922](https://github.com/kubeflow/training-operator/pull/1922) by [@tenzen-y](https://github.com/tenzen-y)) +- Creating service account where approriate for MPI Job ([#1917](https://github.com/kubeflow/training-operator/pull/1917) by [@rpemsel](https://github.com/rpemsel)) +- Build XGBoostJob example images in CI ([#1913](https://github.com/kubeflow/training-operator/pull/1913) by [@tenzen-y](https://github.com/tenzen-y)) +- Manage kube-delivery image from training-operator and update it ([#1909](https://github.com/kubeflow/training-operator/pull/1909) by [@rpemsel](https://github.com/rpemsel)) +- Adding Yuki to Approvers ([#1901](https://github.com/kubeflow/training-operator/pull/1901) by [@johnugeorge](https://github.com/johnugeorge)) +- docs: Remove reference to tf-operator specific design doc ([#1903](https://github.com/kubeflow/training-operator/pull/1903) by [@terrytangyuan](https://github.com/terrytangyuan)) +- Add Training WG Community Call ([#1900](https://github.com/kubeflow/training-operator/pull/1900) by [@andreyvelich](https://github.com/andreyvelich)) +- update full change list in changelog ([#1895](https://github.com/kubeflow/training-operator/pull/1895) by [@lowang-bh](https://github.com/lowang-bh)) +- update volcano scheduler to 1.8.0 ([#1894](https://github.com/kubeflow/training-operator/pull/1894) by [@lowang-bh](https://github.com/lowang-bh)) +- Changelog updated for 1.7.0 rc0 release ([#1892](https://github.com/kubeflow/training-operator/pull/1892) by [@johnugeorge](https://github.com/johnugeorge)) +- Add Stale GitHub Action ([#1893](https://github.com/kubeflow/training-operator/pull/1893) by [@andreyvelich](https://github.com/andreyvelich)) +- Refactor core/pod tests ([#1890](https://github.com/kubeflow/training-operator/pull/1890) by [@tenzen-y](https://github.com/tenzen-y)) +- Remove klog v1 ([#1886](https://github.com/kubeflow/training-operator/pull/1886) by [@tenzen-y](https://github.com/tenzen-y)) + +[Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.7.0...v1.8.0-rc.0) + +# [v1.7.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.7.0-rc.0) (2023-07-07) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.6.0...v1.7.0-rc.0) -**Breaking Changes** +## Breaking Changes + - Upgrade Scheduler Plugins version to v0.25.7 https://github.com/kubeflow/training-operator/pull/1824 ([tenzen-y](https://github.com/tenzen-y)) - Upgrade the kubernetes dependencies to v1.27 https://github.com/kubeflow/training-operator/pull/1834 ([tenzen-y](https://github.com/tenzen-y)) -**New features** +## New Features + - Make scheduler-plugins the default gang scheduler. [\#1747](https://github.com/kubeflow/training-operator/pull/1747) ([Syulin7](https://github.com/Syulin7)) - Merge kubeflow/common to training-operator [\#1813](https://github.com/kubeflow/training-operator/pull/1813) ([johnugeorge](https://github.com/johnugeorge)) -- Auto-generate RBAC manifests by the controller-gen [\#1815](https://github.com/kubeflow/training-operator/pull/1815) ([Syulin7](https://github.com/Syulin7)) +- Auto-generate RBAC manifests by the controller-gen [\#1815](https://github.com/kubeflow/training-operator/pull/1815) ([Syulin7](https://github.com/Syulin7)) - Implement suspend semantics [\#1859](https://github.com/kubeflow/training-operator/pull/1859) ([tenzen-y](https://github.com/tenzen-y)) - Set up controllers using goroutines to start the manager quickly [\#1869](https://github.com/kubeflow/training-operator/pull/1869) ([tenzen-y](https://github.com/tenzen-y)) - Set correct ENV for PytorchJob to support torchrun [\#1840](https://github.com/kubeflow/training-operator/pull/1840) ([kuizhiqing](https://github.com/kuizhiqing)) -**Bug fixes** +## Bug Fixes + - Fix a bug that XGBoostJob's running condition isn't updated when the job is resumed [\#1866](https://github.com/kubeflow/training-operator/pull/1866) ([tenzen-y](https://github.com/tenzen-y)) - Set a Running condition when the XGBoostJob is completed and doesn't have a Running condition [\#1789](https://github.com/kubeflow/training-operator/pull/1789) ([tenzen-y](https://github.com/tenzen-y)) - Avoid to depend on local env when installing the code-generators [\#1810](https://github.com/kubeflow/training-operator/pull/1810) ([tenzen-y](https://github.com/tenzen-y)) +## Misc -**Misc** - Removing reconciler code [\#1879](https://github.com/kubeflow/training-operator/pull/1879) ([johnugeorge](https://github.com/johnugeorge)) - Make Condition and ReplicaStatus optional [\#1862](https://github.com/kubeflow/training-operator/pull/1862) ([tenzen-y](https://github.com/tenzen-y)) - Use the same reasons for Condition and Event [\#1854](https://github.com/kubeflow/training-operator/pull/1854) ([tenzen-y](https://github.com/tenzen-y)) @@ -39,16 +153,16 @@ - xgb yaml container name should be consistent with xgb job default container name [\#1794](https://github.com/kubeflow/training-operator/pull/1794) ([Crisescode](https://github.com/Crisescode)) - make timeout configurable from e2e tests [\#1787](https://github.com/kubeflow/training-operator/pull/1787) ([nagar-ajay](https://github.com/nagar-ajay)) +# [v1.6.0](https://github.com/kubeflow/training-operator/tree/v1.6.0) (2023-03-21) -## [v1.6.0](https://github.com/kubeflow/training-operator/tree/v1.6.0) (2023-03-21) - -Note: Since scheduler-plugins has changed API from `sigs.k8s.io` with the `x-k8s.io`, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: [\#1769](https://github.com/kubeflow/training-operator/issues/1769) +Note: Since scheduler-plugins has changed API from `sigs.k8s.io` with the `x-k8s.io`, future releases of training operator(v1.7+) will not support scheduler-plugins v0.24.x or lower. Related: [\#1769](https://github.com/kubeflow/training-operator/issues/1769) Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training/1.6.0/) does not support earlier training operator versions. The minimum training operator version required is v1.6.0 release. Related: [\#1702](https://github.com/kubeflow/training-operator/pull/1702) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.5.0...v1.6.0) -**New Features** +## New Features + - Support for k8s v1.25 in CI [\#1684](https://github.com/kubeflow/training-operator/pull/1684) ([johnugeorge](https://github.com/johnugeorge)) - HPA support for PyTorch Elastic [\#1701](https://github.com/kubeflow/training-operator/pull/1701) ([johnugeorge](https://github.com/johnugeorge)) - Adopting coschduling plugin [\#1724](https://github.com/kubeflow/training-operator/pull/1724) ([tenzen-y](https://github.com/tenzen-y)) @@ -57,9 +171,9 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - \[SDK\] Use Training Client without Kube Config [\#1740](https://github.com/kubeflow/training-operator/pull/1740) ([andreyvelich](https://github.com/andreyvelich)) - \[SDK\] Create Unify Training Client [\#1719](https://github.com/kubeflow/training-operator/pull/1719) ([andreyvelich](https://github.com/andreyvelich)) +## Bug Fixes -**Bug fixes** -- [SDK] pod has no metadata attr anymore in the get\_job\_logs\(\) … [\#1760](https://github.com/kubeflow/training-operator/pull/1760) ([yaobaiwei](https://github.com/yaobaiwei)) +- [SDK] pod has no metadata attr anymore in the get_job_logs\(\) … [\#1760](https://github.com/kubeflow/training-operator/pull/1760) ([yaobaiwei](https://github.com/yaobaiwei)) - Add PodGroup as controller watch source [\#1666](https://github.com/kubeflow/training-operator/pull/1666) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg)) - fix infinite loop in init-pytorch container [\#1756](https://github.com/kubeflow/training-operator/pull/1756) ([kidddddddddddddddddddddd](https://github.com/kidddddddddddddddddddddd)) - Fix the success condition of the job in PyTorchJob's Elastic mode. [\#1752](https://github.com/kubeflow/training-operator/pull/1752) ([Syulin7](https://github.com/Syulin7)) @@ -75,13 +189,14 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - handle all restart policies [\#1649](https://github.com/kubeflow/training-operator/pull/1649) ([abin-thomas-by](https://github.com/abin-thomas-by)) - \[chore\] fix typo [\#1648](https://github.com/kubeflow/training-operator/pull/1648) ([tenzen-y](https://github.com/tenzen-y)) -**Misc** +## Misc + - Add validation for verifying that the CustomJob \(e.g., TFJob\) name meets DNS1035 [\#1748](https://github.com/kubeflow/training-operator/pull/1748) ([tenzen-y](https://github.com/tenzen-y)) - Configure controller worker threads [\#1707](https://github.com/kubeflow/training-operator/pull/1707) ([HeGaoYuan](https://github.com/HeGaoYuan)) - Validation Spec consistency [\#1705](https://github.com/kubeflow/training-operator/pull/1705) ([HeGaoYuan](https://github.com/HeGaoYuan)) - \[SDK\] Remove Final Keyword from constants [\#1676](https://github.com/kubeflow/training-operator/pull/1676) ([andreyvelich](https://github.com/andreyvelich)) - Fix Python installation in CI [\#1759](https://github.com/kubeflow/training-operator/pull/1759) ([tenzen-y](https://github.com/tenzen-y)) -- Update mpijob\_controller.go [\#1755](https://github.com/kubeflow/training-operator/pull/1755) ([yshalabi](https://github.com/yshalabi)) +- Update mpijob_controller.go [\#1755](https://github.com/kubeflow/training-operator/pull/1755) ([yshalabi](https://github.com/yshalabi)) - Set the default value of CleanPodPolicy to None [\#1754](https://github.com/kubeflow/training-operator/pull/1754) ([Syulin7](https://github.com/Syulin7)) - Update join Slack link [\#1750](https://github.com/kubeflow/training-operator/pull/1750) ([Syulin7](https://github.com/Syulin7)) - Update latest operator image [\#1742](https://github.com/kubeflow/training-operator/pull/1742) ([johnugeorge](https://github.com/johnugeorge)) @@ -94,13 +209,13 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - docs: Update Kubernetes requirement and version matrix [\#1721](https://github.com/kubeflow/training-operator/pull/1721) ([terrytangyuan](https://github.com/terrytangyuan)) - chore: Update the use of MultiWorkerMirroredStrategy in TF [\#1715](https://github.com/kubeflow/training-operator/pull/1715) ([terrytangyuan](https://github.com/terrytangyuan)) - Removing deprecated Job Labels [\#1702](https://github.com/kubeflow/training-operator/pull/1702) ([johnugeorge](https://github.com/johnugeorge)) -- Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf\_operator [\#1699](https://github.com/kubeflow/training-operator/pull/1699) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Bump certifi from 2022.9.14 to 2022.12.7 in /py/kubeflow/tf_operator [\#1699](https://github.com/kubeflow/training-operator/pull/1699) ([dependabot[bot]](https://github.com/apps/dependabot)) - Add myself to reviewer. [\#1689](https://github.com/kubeflow/training-operator/pull/1689) ([kuizhiqing](https://github.com/kuizhiqing)) - Upgrade the envtest version [\#1687](https://github.com/kubeflow/training-operator/pull/1687) ([tenzen-y](https://github.com/tenzen-y)) - \[chore\] Upgrade some actions version [\#1686](https://github.com/kubeflow/training-operator/pull/1686) ([tenzen-y](https://github.com/tenzen-y)) - Upgrade Golangci-lint [\#1685](https://github.com/kubeflow/training-operator/pull/1685) ([johnugeorge](https://github.com/johnugeorge)) - Make a generic logger instead of the nil logger on dependent update [\#1680](https://github.com/kubeflow/training-operator/pull/1680) ([ggaaooppeenngg](https://github.com/ggaaooppeenngg)) -- Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf\_operator [\#1669](https://github.com/kubeflow/training-operator/pull/1669) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Bump protobuf from 3.8.0 to 3.18.3 in /py/kubeflow/tf_operator [\#1669](https://github.com/kubeflow/training-operator/pull/1669) ([dependabot[bot]](https://github.com/apps/dependabot)) - Removed GOARCH dependency for multiarch support [\#1674](https://github.com/kubeflow/training-operator/pull/1674) ([pranavpandit1](https://github.com/pranavpandit1)) - Update deployment.yaml [\#1668](https://github.com/kubeflow/training-operator/pull/1668) ([OmriShiv](https://github.com/OmriShiv)) - Upgrade Go version to v1.19 [\#1663](https://github.com/kubeflow/training-operator/pull/1663) ([tenzen-y](https://github.com/tenzen-y)) @@ -111,18 +226,18 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - Add finalizers to cluster-role [\#1646](https://github.com/kubeflow/training-operator/pull/1646) ([ArangoGutierrez](https://github.com/ArangoGutierrez)) - Update the cmd to support MPI operator in ReadME [\#1656](https://github.com/kubeflow/training-operator/pull/1656) ([denkensk](https://github.com/denkensk)) -**Closed issues:** +## Closed issues - The default value for CleanPodPolicy is inconsistent. [\#1753](https://github.com/kubeflow/training-operator/issues/1753) -- HPA support for PyTorch Elastic [\#1751](https://github.com/kubeflow/training-operator/issues/1751) +- HPA support for PyTorch Elastic [\#1751](https://github.com/kubeflow/training-operator/issues/1751) - Bug: allowance of non DNS-1035 compliant PyTorchJob names results in service creation failures and missing state [\#1745](https://github.com/kubeflow/training-operator/issues/1745) -- paddle-operator can not get podgroup status\(inqueue\) with volcano when enable gang [\#1729](https://github.com/kubeflow/training-operator/issues/1729) +- paddle-operator can not get podgroup status\(inqueue\) with volcano when enable gang [\#1729](https://github.com/kubeflow/training-operator/issues/1729) - \*job API\(master\) cannot compatible with old job [\#1725](https://github.com/kubeflow/training-operator/issues/1725) - Support coscheduling plugin [\#1722](https://github.com/kubeflow/training-operator/issues/1722) - Number of worker threads used by the controller can't be configured [\#1706](https://github.com/kubeflow/training-operator/issues/1706) - Conformance: Training tests [\#1698](https://github.com/kubeflow/training-operator/issues/1698) - PyTorch and MPI Operator pulls hardcoded initContainer [\#1696](https://github.com/kubeflow/training-operator/issues/1696) -- PaddlePaddle Training: why can't find pods [\#1694](https://github.com/kubeflow/training-operator/issues/1694) +- PaddlePaddle Training: why can't find pods [\#1694](https://github.com/kubeflow/training-operator/issues/1694) - Training-operator pod CrashLoopBackOff in K8s v1.23.6 with kubeflow1.6.1 [\#1693](https://github.com/kubeflow/training-operator/issues/1693) - \[SDK\] Create unify client for all Training Job types [\#1691](https://github.com/kubeflow/training-operator/issues/1691) - Support Kubernetes v1.25 [\#1682](https://github.com/kubeflow/training-operator/issues/1682) @@ -145,24 +260,24 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - The pytorchJob training is slow [\#1532](https://github.com/kubeflow/training-operator/issues/1532) - pytorch elastic scheduler error [\#1504](https://github.com/kubeflow/training-operator/issues/1504) -## [v1.4.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0) (2022-01-26) +# [v1.4.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.4.0-rc.0) (2022-01-26) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.3.0...v1.4.0-rc.0) -**Features and improvements:** +## Features and Improvements - Display coverage % in GitHub actions list [\#1442](https://github.com/kubeflow/training-operator/issues/1442) - Add Go test to CI [\#1436](https://github.com/kubeflow/training-operator/issues/1436) -**Fixed bugs:** +## Fixed Bugs - \[bug\] Missing init container in PyTorchJob [\#1482](https://github.com/kubeflow/training-operator/issues/1482) - Fail to install tf-operator in minikube because of the version of kubectl/kustomize [\#1381](https://github.com/kubeflow/training-operator/issues/1381) -**Closed issues:** +## Closed Issues -- Restore KUBEFLOW\_NAMESPACE options [\#1522](https://github.com/kubeflow/training-operator/issues/1522) -- Improve test coverage [\#1497](https://github.com/kubeflow/training-operator/issues/1497) +- Restore KUBEFLOW_NAMESPACE options [\#1522](https://github.com/kubeflow/training-operator/issues/1522) +- Improve test coverage [\#1497](https://github.com/kubeflow/training-operator/issues/1497) - swagger.json missing Pytorchjob.Spec.ElasticPolicy [\#1483](https://github.com/kubeflow/training-operator/issues/1483) - PytorchJob DDP training will stop if I delete a worker pod [\#1478](https://github.com/kubeflow/training-operator/issues/1478) - Write down e2e failure debug process [\#1467](https://github.com/kubeflow/training-operator/issues/1467) @@ -171,9 +286,9 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - Podgroup is constantly created and deleted after tfjob is success or failure [\#1426](https://github.com/kubeflow/training-operator/issues/1426) - Cut official release of 1.3.0 [\#1425](https://github.com/kubeflow/training-operator/issues/1425) - Add "not maintained" notice to other operator repos [\#1423](https://github.com/kubeflow/training-operator/issues/1423) -- Python SDK for Kubeflow Training Operator [\#1380](https://github.com/kubeflow/training-operator/issues/1380) +- Python SDK for Kubeflow Training Operator [\#1380](https://github.com/kubeflow/training-operator/issues/1380) -**Merged pull requests:** +## Merged Pull Requests - Update manifests with latest image tag [\#1527](https://github.com/kubeflow/training-operator/pull/1527) ([johnugeorge](https://github.com/johnugeorge)) - add option for mpi kubectl delivery [\#1525](https://github.com/kubeflow/training-operator/pull/1525) ([zw0610](https://github.com/zw0610)) @@ -189,7 +304,7 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - chore: Fix GitHub Actions script [\#1491](https://github.com/kubeflow/training-operator/pull/1491) ([tenzen-y](https://github.com/tenzen-y)) - chore: Fix missspell in tfjob [\#1490](https://github.com/kubeflow/training-operator/pull/1490) ([tenzen-y](https://github.com/tenzen-y)) - chore: Update OWNERS [\#1489](https://github.com/kubeflow/training-operator/pull/1489) ([gaocegege](https://github.com/gaocegege)) -- Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf\_operator [\#1487](https://github.com/kubeflow/training-operator/pull/1487) ([dependabot[bot]](https://github.com/apps/dependabot)) +- Bump jinja2 from 2.10.1 to 2.11.3 in /py/kubeflow/tf_operator [\#1487](https://github.com/kubeflow/training-operator/pull/1487) ([dependabot[bot]](https://github.com/apps/dependabot)) - fix comments for mpi-controller [\#1485](https://github.com/kubeflow/training-operator/pull/1485) ([hackerboy01](https://github.com/hackerboy01)) - add expectation-related functions for other resources used in mpi-controller [\#1484](https://github.com/kubeflow/training-operator/pull/1484) ([zw0610](https://github.com/zw0610)) - Add MPI job to README now that it's supported [\#1480](https://github.com/kubeflow/training-operator/pull/1480) ([terrytangyuan](https://github.com/terrytangyuan)) @@ -224,149 +339,145 @@ Note: Latest [Python SDK 1.6 version](https://pypi.org/project/kubeflow-training - Bump controller-tools to 0.6.0 and enable GenerateEmbeddedObjectMeta [\#1409](https://github.com/kubeflow/training-operator/pull/1409) ([Jeffwan](https://github.com/Jeffwan)) - Update scripts to generate sdk for all frameworks [\#1389](https://github.com/kubeflow/training-operator/pull/1389) ([Jeffwan](https://github.com/Jeffwan)) -## [v1.3.0](https://github.com/kubeflow/training-operator/tree/v1.3.0) (2021-10-03) +# [v1.3.0](https://github.com/kubeflow/training-operator/tree/v1.3.0) (2021-10-03) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.3.0-rc.2...v1.3.0) -**Fixed bugs:** +## Fixed Bugs - Unable to specify pod template metadata for TFJob [\#1403](https://github.com/kubeflow/training-operator/issues/1403) -## [v1.3.0-rc.2](https://github.com/kubeflow/training-operator/tree/v1.3.0-rc.2) (2021-09-21) +# [v1.3.0-rc.2](https://github.com/kubeflow/training-operator/tree/v1.3.0-rc.2) (2021-09-21) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.3.0-rc.1...v1.3.0-rc.2) -**Fixed bugs:** +## Fixed Bugs - Missing Pod label for Service selector [\#1399](https://github.com/kubeflow/training-operator/issues/1399) -## [v1.3.0-rc.1](https://github.com/kubeflow/training-operator/tree/v1.3.0-rc.1) (2021-09-15) +# [v1.3.0-rc.1](https://github.com/kubeflow/training-operator/tree/v1.3.0-rc.1) (2021-09-15) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.3.0-rc.0...v1.3.0-rc.1) -**Fixed bugs:** +## Fixed Bugs - \[bug\] Reconcilation fails when upgrading common to 0.3.6 [\#1394](https://github.com/kubeflow/training-operator/issues/1394) -**Merged pull requests:** +## Merged Pull Requests -- Update manifests with latest image tag [\#1406](https://github.com/kubeflow/training-operator/pull/1406) ([johnugeorge](https://github.com/johnugeorge)) -- 2010: fix to expose correct monitoring port [\#1405](https://github.com/kubeflow/training-operator/pull/1405) ([deepak-muley](https://github.com/deepak-muley)) +- Update manifests with latest image tag [\#1406](https://github.com/kubeflow/training-operator/pull/1406) ([johnugeorge](https://github.com/johnugeorge)) +- 2010: fix to expose correct monitoring port [\#1405](https://github.com/kubeflow/training-operator/pull/1405) ([deepak-muley](https://github.com/deepak-muley)) - Fix 1399: added pod matching label in service selector [\#1404](https://github.com/kubeflow/training-operator/pull/1404) ([deepak-muley](https://github.com/deepak-muley)) - fix: runPolicy validation error in the examples [\#1401](https://github.com/kubeflow/training-operator/pull/1401) ([Jeffwan](https://github.com/Jeffwan)) -## [v1.3.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.3.0-rc.0) (2021-08-31) +# [v1.3.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.3.0-rc.0) (2021-08-31) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.3.0-alpha.3...v1.3.0-rc.0) -**Merged pull requests:** +## Merged Pull Requests - chore: Update training-operator tag [\#1396](https://github.com/kubeflow/training-operator/pull/1396) ([Jeffwan](https://github.com/Jeffwan)) - Add simple verification jobs [\#1391](https://github.com/kubeflow/training-operator/pull/1391) ([Jeffwan](https://github.com/Jeffwan)) - fix: volcano pod group creation issue [\#1390](https://github.com/kubeflow/training-operator/pull/1390) ([Jeffwan](https://github.com/Jeffwan)) - chore: Bump kubeflow/common version to 0.3.7 [\#1388](https://github.com/kubeflow/training-operator/pull/1388) ([Jeffwan](https://github.com/Jeffwan)) -## [v1.3.0-alpha.3](https://github.com/kubeflow/training-operator/tree/v1.3.0-alpha.3) (2021-08-29) +# [v1.3.0-alpha.3](https://github.com/kubeflow/training-operator/tree/v1.3.0-alpha.3) (2021-08-29) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.2.1...v1.3.0-alpha.3) -**Closed issues:** +## Closed Issues -- Update guidance to install all-in-one operator in README.md [\#1386](https://github.com/kubeflow/training-operator/issues/1386) +- Update guidance to install all-in-one operator in README.md [\#1386](https://github.com/kubeflow/training-operator/issues/1386) -**Merged pull requests:** +## Merged Pull Requests - chore\(doc\): Update README.md [\#1387](https://github.com/kubeflow/training-operator/pull/1387) ([Jeffwan](https://github.com/Jeffwan)) - Remove tf-operator from the codebase [\#1378](https://github.com/kubeflow/training-operator/pull/1378) ([thunderboltsid](https://github.com/thunderboltsid)) -## [v1.2.1](https://github.com/kubeflow/training-operator/tree/v1.2.1) (2021-08-27) +# [v1.2.1](https://github.com/kubeflow/training-operator/tree/v1.2.1) (2021-08-27) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.3.0-alpha.2...v1.2.1) -## [v1.3.0-alpha.2](https://github.com/kubeflow/training-operator/tree/v1.3.0-alpha.2) (2021-08-15) +# [v1.3.0-alpha.2](https://github.com/kubeflow/training-operator/tree/v1.3.0-alpha.2) (2021-08-15) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.3.0-alpha.1...v1.3.0-alpha.2) -## [v1.3.0-alpha.1](https://github.com/kubeflow/training-operator/tree/v1.3.0-alpha.1) (2021-08-13) +# [v1.3.0-alpha.1](https://github.com/kubeflow/training-operator/tree/v1.3.0-alpha.1) (2021-08-13) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.2.0...v1.3.0-alpha.1) -## [v1.2.0](https://github.com/kubeflow/training-operator/tree/v1.2.0) (2021-08-03) +# [v1.2.0](https://github.com/kubeflow/training-operator/tree/v1.2.0) (2021-08-03) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.1.0...v1.2.0) -## [v1.1.0](https://github.com/kubeflow/training-operator/tree/v1.1.0) (2021-03-20) +# [v1.1.0](https://github.com/kubeflow/training-operator/tree/v1.1.0) (2021-03-20) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.0.1-rc.5...v1.1.0) -## [v1.0.1-rc.5](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.5) (2021-02-09) +# [v1.0.1-rc.5](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.5) (2021-02-09) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.0.1-rc.4...v1.0.1-rc.5) -## [v1.0.1-rc.4](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.4) (2021-02-04) +# [v1.0.1-rc.4](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.4) (2021-02-04) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.0.1-rc.3...v1.0.1-rc.4) -## [v1.0.1-rc.3](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.3) (2021-01-27) +# [v1.0.1-rc.3](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.3) (2021-01-27) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.0.1-rc.2...v1.0.1-rc.3) -## [v1.0.1-rc.2](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.2) (2021-01-27) +# [v1.0.1-rc.2](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.2) (2021-01-27) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.0.1-rc.1...v1.0.1-rc.2) -## [v1.0.1-rc.1](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.1) (2021-01-18) +# [v1.0.1-rc.1](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.1) (2021-01-18) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.0.1-rc.0...v1.0.1-rc.1) -## [v1.0.1-rc.0](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.0) (2020-12-22) +# [v1.0.1-rc.0](https://github.com/kubeflow/training-operator/tree/v1.0.1-rc.0) (2020-12-22) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v1.0.0-rc.0...v1.0.1-rc.0) -## [v1.0.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.0.0-rc.0) (2019-06-24) +# [v1.0.0-rc.0](https://github.com/kubeflow/training-operator/tree/v1.0.0-rc.0) (2019-06-24) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.5.3...v1.0.0-rc.0) -## [v0.5.3](https://github.com/kubeflow/training-operator/tree/v0.5.3) (2019-06-03) +# [v0.5.3](https://github.com/kubeflow/training-operator/tree/v0.5.3) (2019-06-03) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.5.2...v0.5.3) -## [v0.5.2](https://github.com/kubeflow/training-operator/tree/v0.5.2) (2019-05-23) +# [v0.5.2](https://github.com/kubeflow/training-operator/tree/v0.5.2) (2019-05-23) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.5.1...v0.5.2) -## [v0.5.1](https://github.com/kubeflow/training-operator/tree/v0.5.1) (2019-05-15) +# [v0.5.1](https://github.com/kubeflow/training-operator/tree/v0.5.1) (2019-05-15) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.5.0...v0.5.1) -## [v0.5.0](https://github.com/kubeflow/training-operator/tree/v0.5.0) (2019-03-26) +# [v0.5.0](https://github.com/kubeflow/training-operator/tree/v0.5.0) (2019-03-26) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.4.0...v0.5.0) -## [v0.4.0](https://github.com/kubeflow/training-operator/tree/v0.4.0) (2019-02-13) +# [v0.4.0](https://github.com/kubeflow/training-operator/tree/v0.4.0) (2019-02-13) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.4.0-rc.1...v0.4.0) -## [v0.4.0-rc.1](https://github.com/kubeflow/training-operator/tree/v0.4.0-rc.1) (2018-11-28) +# [v0.4.0-rc.1](https://github.com/kubeflow/training-operator/tree/v0.4.0-rc.1) (2018-11-28) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.4.0-rc.0...v0.4.0-rc.1) -## [v0.4.0-rc.0](https://github.com/kubeflow/training-operator/tree/v0.4.0-rc.0) (2018-11-19) +# [v0.4.0-rc.0](https://github.com/kubeflow/training-operator/tree/v0.4.0-rc.0) (2018-11-19) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.3.0...v0.4.0-rc.0) -## [v0.3.0](https://github.com/kubeflow/training-operator/tree/v0.3.0) (2018-09-22) +# [v0.3.0](https://github.com/kubeflow/training-operator/tree/v0.3.0) (2018-09-22) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.2.0-rc1...v0.3.0) -## [v0.2.0-rc1](https://github.com/kubeflow/training-operator/tree/v0.2.0-rc1) (2018-06-21) +# [v0.2.0-rc1](https://github.com/kubeflow/training-operator/tree/v0.2.0-rc1) (2018-06-21) [Full Changelog](https://github.com/kubeflow/training-operator/compare/v0.1.0...v0.2.0-rc1) -## [v0.1.0](https://github.com/kubeflow/training-operator/tree/v0.1.0) (2018-03-29) +# [v0.1.0](https://github.com/kubeflow/training-operator/tree/v0.1.0) (2018-03-29) [Full Changelog](https://github.com/kubeflow/training-operator/compare/5b1ff9c7058c2af718ed8d399aebcfd124217f8c...v0.1.0) - - - -\* *This Changelog was automatically generated by [github_changelog_generator](https://github.com/github-changelog-generator/github-changelog-generator)* diff --git a/docs/release/README.md b/docs/release/README.md new file mode 100644 index 0000000000..f332bbe2ef --- /dev/null +++ b/docs/release/README.md @@ -0,0 +1,59 @@ +# Releasing the Training Operator + +## Prerequisite + +1. Permissions + - You need write permissions on the repository to create a release tag/branch. +1. Prepare your Github Token + +1. Install Github python dependencies to generate changlog + ``` + pip install PyGithub==1.55 + ``` + +### Release Process + +1. Make sure the last commit you want to release past `kubeflow-training-operator-postsubmit` testing. + +1. Check out that commit (in this example, we'll use `6214e560`). + +1. Depends on what version you want to release, + + - Major or Minor version - Use the GitHub UI to cut a release branch and name the release branch `v{MAJOR}.${MINOR}-branch` + - Patch version - You don't need to cut release branch. + +1. Create a new PR against the release branch to change container image in manifest to point to that commit hash. + + ``` + images: + - name: kubeflow/training-operator + newName: kubeflow/training-operator + newTag: ${commit_hash} + ``` + + > note: post submit job will always build a new image using the `PULL_BASE_HASH` as image tag. + +1. Create a tag and push tag to upstream. + + ``` + git tag v1.2.0 + git push upstream v1.2.0 + ``` + +1. Update the Changelog by running: + + ``` + python docs/release/changelog.py --token= --range=.. + ``` + + If you are creating the **first minor pre-release** or the **minor** release (`X.Y`), your + `previous-release` is equal to the latest release on the `vX.Y-branch` branch. + For example: `--range=v1.7.1..v1.8.0` + + Otherwise, your `previous-release` is equal to the latest release on the `vX.Y-branch` branch. + For example: `--range=v1.7.0..v1.8.0-rc.0` + + Group PRs in the Changelog into Features, Bug fixes, Documentation, etc. + Check this example: [v1.7.0-rc.0](https://github.com/kubeflow/training-operator/blob/master/CHANGELOG.md#v170-rc0-2023-07-07) + + Finally, submit a PR with the updated Changelog. diff --git a/docs/release/changelog.py b/docs/release/changelog.py new file mode 100644 index 0000000000..ac508d025f --- /dev/null +++ b/docs/release/changelog.py @@ -0,0 +1,71 @@ +from github import Github +import argparse + +REPO_NAME = "kubeflow/training-operator" +CHANGELOG_FILE = "CHANGELOG.md" + +parser = argparse.ArgumentParser() +parser.add_argument("--token", type=str, help="GitHub Access Token") +parser.add_argument( + "--range", type=str, help="Changelog is generated for this release range" +) +args = parser.parse_args() + +if args.token is None: + raise Exception("GitHub Token must be set") +try: + previous_release = args.range.split("..")[0] + current_release = args.range.split("..")[1] +except Exception: + raise Exception("Release range must be set in this format: v1.7.0..v1.8.0") + +# Get list of commits from the range. +github_repo = Github(args.token).get_repo(REPO_NAME) +comparison = github_repo.compare(previous_release, current_release) +commits = comparison.commits + +# The latest commit contains the release date. +release_date = str(commits[-1].commit.author.date).split(" ")[0] +release_url = "https://github.com/{}/tree/{}".format(REPO_NAME, current_release) + +# Get all PRs in reverse chronological order from the commits. +pr_list = "" +pr_set = set() +for commit in reversed(commits): + # Only add commits with PRs. + for pr in commit.get_pulls(): + # Each PR is added only one time to the list. + if pr.number in pr_set: + continue + pr_set.add(pr.number) + + new_pr = "- {title} ([#{id}]({pr_link}) by [@{user_id}]({user_url}))\n".format( + title=pr.title, + id=pr.number, + pr_link=pr.html_url, + user_id=pr.user.login, + user_url=pr.user.html_url, + ) + pr_list += new_pr + +change_log = [ + "# Changelog" "\n\n", + "# [{}]({}) ({})".format(current_release, release_url, release_date), + "\n\n", + "## TODO: Group PRs into Breaking Changes, New Features, Bug fixes, Misc, etc. " + + "For example: [v1.7.0](https://github.com/kubeflow/training-operator/releases/tag/v1.7.0)", + "\n\n", + pr_list, + "\n" "[Full Changelog]({})\n".format(comparison.html_url), +] + +# Update Changelog with the new changes. +with open(CHANGELOG_FILE, "r+") as f: + lines = f.readlines() + f.seek(0) + lines = lines[0:0] + change_log + lines[1:] + f.writelines(lines) + +print("Changelog has been updated\n") +print("Group PRs in the Changelog into Features, Bug fixes, Misc, etc.\n") +print("After that, submit a PR with the updated Changelog") diff --git a/docs/release/release.py b/docs/release/release.py deleted file mode 100644 index a463220102..0000000000 --- a/docs/release/release.py +++ /dev/null @@ -1,43 +0,0 @@ -from github import Github -import re - - -class ChangelogGenerator: - def __init__(self, github_repo): - # Replace with your Github Token - self._github = Github('') - self._github_repo = self._github.get_repo(github_repo) - - def generate(self, pr_id): - pr = self._github_repo.get_pull(pr_id) - - return "{title} ([#{pr_id}]({pr_link}), @{user})".format( - title=pr.title, - pr_id=pr_id, - pr_link=pr.html_url, - user=pr.user.login - ) - - -# generated by `git log ..HEAD --oneline` -payload = ''' -6f1e96c4 Update container image for v1.1.1 (#1328) -47a74b73 add a specific version of tensorflow_datasets (#1305) -e3061132 Remove vendor folder (#1288) -eb362bd8 Fix invalid pointer when tfjob is deleted (#1285) -0c41b273 fix get_logs pod_names type and iteration blocking (#1280) -af5bdd58 Add job namespace to `tf_operator_jobs_*` counters (#1283) -6fd9489e fix custom_api.delete_namespaced_custom_object args (#1281) -c095f7a9 feat: upgrade kubeflow common and volcano version (#1276) -13b17b0e Use remote Kustomize build option in standalone installation instructions (#1266) -faf34868 fix: Remove the dup comment tag (#1274) -9a297876 add podgroups rule in cluster-role.yaml (#1272) -58c9bc4a Fix: the "follow" of TFJobClient.get_logs (#1254) -3d9e7c8a Add task type annotation for pods when EnableGangScheduling is true. (#1268) -8d179f70 Fix: Remove Github CD workflow (#1263) -''' - -g = ChangelogGenerator("kubeflow/training-operator") -for pr_match in re.finditer(r"#(\d+)", payload): - pr_id = int(pr_match.group(1)) - print("* {}".format(g.generate(pr_id))) diff --git a/docs/release/releasing.md b/docs/release/releasing.md deleted file mode 100644 index cd55d24aae..0000000000 --- a/docs/release/releasing.md +++ /dev/null @@ -1,103 +0,0 @@ -# Releasing the training operator - -## Prerequisite - -1. Permissions - - You need to be a member of release-team@kubeflow.org. - - You need write permissions on the repository to create a release tag/branch. - -2. Prepare your Github Token - -3. Install Github python dependencies to generate changlog - ``` - pip install PyGithub - ``` - -### Release Process - -1. Make sure the last commit you want to release past `kubeflow-training-operator-postsubmit` testing. - -1. Check out that commit (in this example, we'll use `6214e560`). - -1. Depends on what version you want to release, - - Major or Minor version - Use the GitHub UI to cut a release branch and name the release branch `v{MAJOR}.${MINOR}-branch` - - Patch version - You don't need to cut release branch. - -1. Create a new PR against the release branch to change container image in manifest to point to that commit hash. - - ``` - images: - - name: kubeflow/training-operator - newName: kubeflow/training-operator - newTag: ${commit_hash} - ``` - - > note: post submit job will always build a new image using the `PULL_BASE_HASH` as image tag. - -1. Create a tag and push tag to upstream. - - ``` - git tag v1.2.0 - git push upstream v1.2.0 - ``` - -1. Run following code and fetch online git commits from last release (v1.1.0) to current release (v1.2.0). - - ``` - git log v1.1.0..v1.2.0 --oneline - ``` - -1. Copy above commit history to `release.py` and replace `` with your Github token. - Run this python scripts to generate changelogs. - - ``` - from github import Github - import re - - - class ChangelogGenerator: - def __init__(self, github_repo): - # Replace with your Github Token - self._github = Github('') - self._github_repo = self._github.get_repo(github_repo) - - def generate(self, pr_id): - pr = self._github_repo.get_pull(pr_id) - - return "{title} ([#{pr_id}]({pr_link}), @{user})".format( - title=pr.title, - pr_id=pr_id, - pr_link=pr.html_url, - user=pr.user.login - ) - - - # generated by `git log .. --oneline` - payload = ''' - 6f1e96c4 Update container image for v1.2.0 (#1328) - 47a74b73 add a specific version of tensorflow_datasets (#1305) - e3061132 Remove vendor folder (#1288) - eb362bd8 Fix invalid pointer when tfjob is deleted (#1285) - 0c41b273 fix get_logs pod_names type and iteration blocking (#1280) - af5bdd58 Add job namespace to `tf_operator_jobs_*` counters (#1283) - 6fd9489e fix custom_api.delete_namespaced_custom_object args (#1281) - c095f7a9 feat: upgrade kubeflow common and volcano version (#1276) - 13b17b0e Use remote Kustomize build option in standalone installation instructions (#1266) - faf34868 fix: Remove the dup comment tag (#1274) - 9a297876 add podgroups rule in cluster-role.yaml (#1272) - 58c9bc4a Fix: the "follow" of TFJobClient.get_logs (#1254) - 3d9e7c8a Add task type annotation for pods when EnableGangScheduling is true. (#1268) - 8d179f70 Fix: Remove Github CD workflow (#1263) - ''' - - g = ChangelogGenerator("kubeflow/training-operator") - for pr_match in re.finditer(r"#(\d+)", payload): - pr_id = int(pr_match.group(1)) - print("* {}".format(g.generate(pr_id))) - ``` - -1. Cut release from tags and copy results from last step. You can group commits into `Features`, `Bugs` etc. -See example [v1.2.0 release](https://github.com/kubeflow/training-operator/releases/tag/v1.2.0) - -1. Send a PR to update [CHANGELOG.md](../../CHANGELOG.md) - \ No newline at end of file