Add Conformance Program Doc for AutoML and Training WG

kubeflow · Dec 1, 2022 · 3a704a0 · 3a704a0
1 parent bd91301
commit 3a704a0
Show file tree

Hide file tree

Showing 5 changed files with 225 additions and 12 deletions.
diff --git a/docs/proposals/conformance-test.md b/docs/proposals/conformance-test.md
@@ -0,0 +1,140 @@
+# Conformance Test for AutoML and Training Working Group
+
+Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich))
+Johnu George ([@johnugeorge](https://github.com/johnugeorge))
+2022-11-21
+[Original Google Doc](https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#).
+
+## Motivation
+
+Kubeflow community needs to design conformance program so the distributions can
+become
+[Certified Kubeflow](https://docs.google.com/document/d/1a9ufoe_6DB1eSjpE9eK5nRBoH3ItoSkbPfxRA0AjPIc/edit?resourcekey=0-IRtbQzWfw5L_geRJ7F7GWQ#).
+Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of
+[their conformance tests](https://github.com/kubeflow/kubeflow/issues/6485).
+We should design the same program for AutoML and Training WG.
+
+This document is based on the original proposal for
+[the Kubeflow Pipelines conformance program](https://docs.google.com/document/d/1_til1HkVBFQ1wCgyUpWuMlKRYI4zP1YPmNxr75mzcps/edit#).
+
+## Objective
+
+Conformance program for AutoML and Training WG should follow the same goals as Pipelines program:
+
+- The tests should be fully automated and executable by anyone who has public
+  access to the Kubeflow repository.
+- The test results should be easy to verify by the Kubeflow Conformance Committee.
+- The tests should not depend on cloud provider (e.g. AWS or GCP).
+- The tests should cover basic functionality of Katib and the Training Operator.
+  It will not cover all features.
+- The tests are expected to evolve in the future versions.
+
+## Kubeflow Conformance
+
+Kubeflow conformance consists the 3 category of tests:
+
+- API-based tests
+
+  Currently, Katib or Training Operator doesn’t have an API server that receives
+  requests from the users. However, Katib has the DB Manager component that is
+  responsible for writing/reading ML Training metrics.
+
+  In the following versions, we should design conformance program for the
+  Katib API-based tests.
+
+- CRD-based tests
+
+  Most of Katib and Training Operator functionality are based on Kubernetes CRD.
+
+  **This document will define a design for CRD-based tests for Katib and the Training Operator.**
+
+- UI-based tests
+
+  In the following versions, we should design conformance program for the
+  Katib UI-based tests.
+
+## Design for the CRD-based tests
+
+![conformance-crd-test](../images/conformance-crd-test.png)
+
+The design is similar to the KFP conformance program for the API-based tests.
+
+For Katib, tests will be based on
+[the `run-e2e-experiment.go` script](https://github.com/kubeflow/katib/blob/570a3e68fff7b963889692d54ee1577fbf65e2ef/test/e2e/v1beta1/hack/gh-actions/run-e2e-experiment.go)
+that we run for our e2e tests.
+
+This script will be converted to use Katib SDK. Tracking issue: https://github.com/kubeflow/katib/issues/2024.
+
+For the Training Operator, tests will be based on [the SDK e2e test.](https://github.com/kubeflow/training-operator/tree/05badc6ee8a071400efe9019d8d60fc242818589/sdk/python/test/e2e)
+
+### Test Workflow
+
+All tests will be run in the _kf-conformance_ namespace inside the separate container.
+That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.
+
+- We are going to use
+  [the unify Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile)
+  for all Kubeflow conformance tests. Distributions (_driver_ on the diagram)
+  need to run the following Makefile commands:
+
+  ```makefile
+
+  # Run the conformance program.
+  run: setup run-katib run-training-operator
+
+  # Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test.
+  # Create temporary folder for the conformance report.
+  setup:
+    kubectl apply -f ./setup.yaml
+    mkdir -p /tmp/kf-conformance
+
+  # Create deployment and run the e2e tests for Katib and Training Operator.
+  run-katib:
+    kubectl apply -f ./katib-conformance.yaml
+
+  run-training-operator:
+    kubectl apply -f ./training-operator-conformance.yaml
+
+  # Download the test deployment results to create PR for the Kubeflow Conformance Committee.
+  report:
+    ./report-conformance.sh
+
+  # Cleans up created resources and directories.
+  cleanup:
+    kubectl delete -f ./setup.yaml
+    kubectl delete -f ./katib-conformance.yaml
+    kubectl delete -f ./training-operator-conformance.yaml
+    rm -rf /tmp/kf-conformance
+  ```
+
+- Katib and Training Operator conformance deployment will have the appropriate
+  RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the
+  _kf-conformance_ namespace.
+
+- Distribution should have access to the internet to download the training datasets
+  (e.g. MNIST) while running the tests.
+
+- When the job is finished, the script generates output.
+
+  For Katib Experiment the output should be as follows:
+
+  ```
+  Test 1 - passed.
+  Experiment name: random-search
+  Experiment status: Experiment has succeeded because max trial count has reached
+  ```
+
+  For Training Operator the output should be as follows:
+
+  ```
+  Test 1 - passed.
+  TFJob name: tfjob-mnist
+  TFJob status: TFJob tfjob-mnist is successfully completed.
+  ```
+
+- The above report can be downloaded from the test deployment by running `make report`.
+
+- When all reports have been collected, the distributions are going to create PR
+  to publish the reports. The Kubeflow Conformance Committee will verify it and
+  make the distribution
+  [Certified Kubeflow](https://github.com/kubeflow/community/blob/master/proposals/kubeflow-conformance-program-proposal.md#overview).
diff --git a/pkg/controller.v1beta1/consts/const.go b/pkg/controller.v1beta1/consts/const.go
@@ -68,6 +68,8 @@ const (
 	LabelExperimentName = "katib.kubeflow.org/experiment"
 	// LabelSuggestionName is the label of suggestion name.
 	LabelSuggestionName = "katib.kubeflow.org/suggestion"
+	// LabelTrialName is the label of trial name.
+	LabelTrialName = "katib.kubeflow.org/trial"
 	// LabelDeploymentName is the label of deployment name.
 	LabelDeploymentName = "katib.kubeflow.org/deployment"
 

diff --git a/pkg/webhook/v1beta1/pod/inject_webhook.go b/pkg/webhook/v1beta1/pod/inject_webhook.go
@@ -94,7 +94,7 @@ func (s *SidecarInjector) Handle(ctx context.Context, req admission.Request) adm
 	// Do mutation
 	mutatedPod, err := s.Mutate(pod, namespace)
 	if err != nil {
-		log.Error(err, "Failed to inject metrics collector")
+		log.Error(err, "Failed to mutate Trial's pod")
 		return admission.Errored(http.StatusBadRequest, err)
 	}
 
@@ -124,17 +124,6 @@ func (s *SidecarInjector) MutationRequired(pod *v1.Pod, ns string) (bool, error)
 		return false, err
 	}
 
-	// If PrimaryPodLabel is not set we mutate all pods which are related to Trial job
-	// Otherwise mutate pod only with appropriate labels
-	if trial.Spec.PrimaryPodLabels != nil {
-		if !isPrimaryPod(pod.Labels, trial.Spec.PrimaryPodLabels) {
-			return false, nil
-		}
-	}
-
-	if trial.Spec.MetricsCollector.Collector.Kind == common.NoneCollector {
-		return false, nil
-	}
 	return true, nil
 }
 
@@ -155,6 +144,21 @@ func (s *SidecarInjector) Mutate(pod *v1.Pod, namespace string) (*v1.Pod, error)
 		return nil, err
 	}
 
+	// Add Katib Trial labels to the Pod metadata.
+	mutatePodMetadata(mutatedPod, trial)
+
+	// Do the following mutation only for the Primary pod.
+	// If PrimaryPodLabel is not set we mutate all pods which are related to Trial job.
+	// Otherwise, mutate pod only with the appropriate labels.
+	if trial.Spec.PrimaryPodLabels != nil && !isPrimaryPod(pod.Labels, trial.Spec.PrimaryPodLabels) {
+		return mutatedPod, nil
+	}
+
+	// If Metrics Collector in None, skip the mutation.
+	if trial.Spec.MetricsCollector.Collector.Kind == common.NoneCollector {
+		return mutatedPod, nil
+	}
+
 	// Create metrics sidecar container spec
 	injectContainer, err := s.getMetricsCollectorContainer(trial, pod)
 	if err != nil {

diff --git a/pkg/webhook/v1beta1/pod/inject_webhook_test.go b/pkg/webhook/v1beta1/pod/inject_webhook_test.go
@@ -1019,3 +1019,49 @@ func TestIsPrimaryPod(t *testing.T) {
 		}
 	}
 }
+
+func TestMutatePodMetadata(t *testing.T) {
+	mutatedPodLabels := map[string]string{
+		"custom-pod-label":    "custom-value",
+		"katib-experiment":    "katib-value",
+		consts.LabelTrialName: "test-trial",
+	}
+
+	testCases := []struct {
+		pod             *v1.Pod
+		trial           *trialsv1beta1.Trial
+		mutatedPod      *v1.Pod
+		testDescription string
+	}{
+		{
+			pod: &v1.Pod{
+				ObjectMeta: metav1.ObjectMeta{
+					Labels: map[string]string{
+						"custom-pod-label": "custom-value",
+					},
+				},
+			},
+			trial: &trialsv1beta1.Trial{
+				ObjectMeta: metav1.ObjectMeta{
+					Name: "test-trial",
+					Labels: map[string]string{
+						"katib-experiment": "katib-value",
+					},
+				},
+			},
+			mutatedPod: &v1.Pod{
+				ObjectMeta: metav1.ObjectMeta{
+					Labels: mutatedPodLabels,
+				},
+			},
+			testDescription: "Mutated Pod should contain label from the origin Pod and Trial",
+		},
+	}
+
+	for _, tc := range testCases {
+		mutatePodMetadata(tc.pod, tc.trial)
+		if !reflect.DeepEqual(tc.mutatedPod, tc.pod) {
+			t.Errorf("Case %v. Expected Pod %v, got %v", tc.testDescription, tc.mutatedPod, tc.pod)
+		}
+	}
+}
diff --git a/pkg/webhook/v1beta1/pod/utils.go b/pkg/webhook/v1beta1/pod/utils.go
@@ -31,6 +31,7 @@ import (
 
 	common "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1"
 	trialsv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/trials/v1beta1"
+	"github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
 	mccommon "github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common"
 )
 
@@ -260,6 +261,26 @@ func mutateMetricsCollectorVolume(pod *v1.Pod, mountPath, sidecarContainerName,
 	return nil
 }
 
+func mutatePodMetadata(pod *v1.Pod, trial *trialsv1beta1.Trial) {
+	podLabels := map[string]string{}
+
+	// Get labels from the created pod.
+	if pod.Labels != nil {
+		podLabels = pod.Labels
+	}
+
+	// Get labels from Trial.
+	for k, v := range trial.Labels {
+		podLabels[k] = v
+	}
+
+	// Add Trial name label.
+	podLabels[consts.LabelTrialName] = trial.GetName()
+
+	// Append label to the Pod metadata.
+	pod.Labels = podLabels
+}
+
 func getSidecarContainerName(cKind common.CollectorKind) string {
 	if cKind == common.StdOutCollector || cKind == common.FileCollector {
 		return mccommon.MetricLoggerCollectorContainerName