Skip to content

Commit

Permalink
Add Conformance Program Doc for AutoML and Training WG
Browse files Browse the repository at this point in the history
  • Loading branch information
andreyvelich committed Dec 1, 2022
1 parent bd91301 commit 3a704a0
Show file tree
Hide file tree
Showing 5 changed files with 225 additions and 12 deletions.
140 changes: 140 additions & 0 deletions docs/proposals/conformance-test.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Conformance Test for AutoML and Training Working Group

Andrey Velichkevich ([@andreyvelich](https://github.com/andreyvelich))
Johnu George ([@johnugeorge](https://github.com/johnugeorge))
2022-11-21
[Original Google Doc](https://docs.google.com/document/d/1TRUKUY1zCCMdgF-nJ7QtzRwifsoQop0V8UnRo-GWlpI/edit#).

## Motivation

Kubeflow community needs to design conformance program so the distributions can
become
[Certified Kubeflow](https://docs.google.com/document/d/1a9ufoe_6DB1eSjpE9eK5nRBoH3ItoSkbPfxRA0AjPIc/edit?resourcekey=0-IRtbQzWfw5L_geRJ7F7GWQ#).
Recently, Kubeflow Pipelines Working Group (WG) implemented the first version of
[their conformance tests](https://github.com/kubeflow/kubeflow/issues/6485).
We should design the same program for AutoML and Training WG.

This document is based on the original proposal for
[the Kubeflow Pipelines conformance program](https://docs.google.com/document/d/1_til1HkVBFQ1wCgyUpWuMlKRYI4zP1YPmNxr75mzcps/edit#).

## Objective

Conformance program for AutoML and Training WG should follow the same goals as Pipelines program:

- The tests should be fully automated and executable by anyone who has public
access to the Kubeflow repository.
- The test results should be easy to verify by the Kubeflow Conformance Committee.
- The tests should not depend on cloud provider (e.g. AWS or GCP).
- The tests should cover basic functionality of Katib and the Training Operator.
It will not cover all features.
- The tests are expected to evolve in the future versions.

## Kubeflow Conformance

Kubeflow conformance consists the 3 category of tests:

- API-based tests

Currently, Katib or Training Operator doesn’t have an API server that receives
requests from the users. However, Katib has the DB Manager component that is
responsible for writing/reading ML Training metrics.

In the following versions, we should design conformance program for the
Katib API-based tests.

- CRD-based tests

Most of Katib and Training Operator functionality are based on Kubernetes CRD.

**This document will define a design for CRD-based tests for Katib and the Training Operator.**

- UI-based tests

In the following versions, we should design conformance program for the
Katib UI-based tests.

## Design for the CRD-based tests

![conformance-crd-test](../images/conformance-crd-test.png)

The design is similar to the KFP conformance program for the API-based tests.

For Katib, tests will be based on
[the `run-e2e-experiment.go` script](https://github.com/kubeflow/katib/blob/570a3e68fff7b963889692d54ee1577fbf65e2ef/test/e2e/v1beta1/hack/gh-actions/run-e2e-experiment.go)
that we run for our e2e tests.

This script will be converted to use Katib SDK. Tracking issue: https://github.com/kubeflow/katib/issues/2024.

For the Training Operator, tests will be based on [the SDK e2e test.](https://github.com/kubeflow/training-operator/tree/05badc6ee8a071400efe9019d8d60fc242818589/sdk/python/test/e2e)

### Test Workflow

All tests will be run in the _kf-conformance_ namespace inside the separate container.
That will help to avoid environment variance and improve fault tolerance. Driver is required to trigger the deployment and download the results.

- We are going to use
[the unify Makefile](https://github.com/kubeflow/kubeflow/blob/2fa0d3665234125aeb8cebe8fe44f0a5a50791c5/conformance/1.5/Makefile)
for all Kubeflow conformance tests. Distributions (_driver_ on the diagram)
need to run the following Makefile commands:

```makefile

# Run the conformance program.
run: setup run-katib run-training-operator

# Sets up the Kubernetes resources (Kubeflow Profile, RBAC) that needs to run the test.
# Create temporary folder for the conformance report.
setup:
kubectl apply -f ./setup.yaml
mkdir -p /tmp/kf-conformance

# Create deployment and run the e2e tests for Katib and Training Operator.
run-katib:
kubectl apply -f ./katib-conformance.yaml

run-training-operator:
kubectl apply -f ./training-operator-conformance.yaml

# Download the test deployment results to create PR for the Kubeflow Conformance Committee.
report:
./report-conformance.sh

# Cleans up created resources and directories.
cleanup:
kubectl delete -f ./setup.yaml
kubectl delete -f ./katib-conformance.yaml
kubectl delete -f ./training-operator-conformance.yaml
rm -rf /tmp/kf-conformance
```

- Katib and Training Operator conformance deployment will have the appropriate
RBAC to Create/Read/Delete Katib Experiment and Training Operator Jobs in the
_kf-conformance_ namespace.

- Distribution should have access to the internet to download the training datasets
(e.g. MNIST) while running the tests.

- When the job is finished, the script generates output.

For Katib Experiment the output should be as follows:

```
Test 1 - passed.
Experiment name: random-search
Experiment status: Experiment has succeeded because max trial count has reached
```

For Training Operator the output should be as follows:

```
Test 1 - passed.
TFJob name: tfjob-mnist
TFJob status: TFJob tfjob-mnist is successfully completed.
```

- The above report can be downloaded from the test deployment by running `make report`.

- When all reports have been collected, the distributions are going to create PR
to publish the reports. The Kubeflow Conformance Committee will verify it and
make the distribution
[Certified Kubeflow](https://github.com/kubeflow/community/blob/master/proposals/kubeflow-conformance-program-proposal.md#overview).
2 changes: 2 additions & 0 deletions pkg/controller.v1beta1/consts/const.go
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ const (
LabelExperimentName = "katib.kubeflow.org/experiment"
// LabelSuggestionName is the label of suggestion name.
LabelSuggestionName = "katib.kubeflow.org/suggestion"
// LabelTrialName is the label of trial name.
LabelTrialName = "katib.kubeflow.org/trial"
// LabelDeploymentName is the label of deployment name.
LabelDeploymentName = "katib.kubeflow.org/deployment"

Expand Down
28 changes: 16 additions & 12 deletions pkg/webhook/v1beta1/pod/inject_webhook.go
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ func (s *SidecarInjector) Handle(ctx context.Context, req admission.Request) adm
// Do mutation
mutatedPod, err := s.Mutate(pod, namespace)
if err != nil {
log.Error(err, "Failed to inject metrics collector")
log.Error(err, "Failed to mutate Trial's pod")
return admission.Errored(http.StatusBadRequest, err)
}

Expand Down Expand Up @@ -124,17 +124,6 @@ func (s *SidecarInjector) MutationRequired(pod *v1.Pod, ns string) (bool, error)
return false, err
}

// If PrimaryPodLabel is not set we mutate all pods which are related to Trial job
// Otherwise mutate pod only with appropriate labels
if trial.Spec.PrimaryPodLabels != nil {
if !isPrimaryPod(pod.Labels, trial.Spec.PrimaryPodLabels) {
return false, nil
}
}

if trial.Spec.MetricsCollector.Collector.Kind == common.NoneCollector {
return false, nil
}
return true, nil
}

Expand All @@ -155,6 +144,21 @@ func (s *SidecarInjector) Mutate(pod *v1.Pod, namespace string) (*v1.Pod, error)
return nil, err
}

// Add Katib Trial labels to the Pod metadata.
mutatePodMetadata(mutatedPod, trial)

// Do the following mutation only for the Primary pod.
// If PrimaryPodLabel is not set we mutate all pods which are related to Trial job.
// Otherwise, mutate pod only with the appropriate labels.
if trial.Spec.PrimaryPodLabels != nil && !isPrimaryPod(pod.Labels, trial.Spec.PrimaryPodLabels) {
return mutatedPod, nil
}

// If Metrics Collector in None, skip the mutation.
if trial.Spec.MetricsCollector.Collector.Kind == common.NoneCollector {
return mutatedPod, nil
}

// Create metrics sidecar container spec
injectContainer, err := s.getMetricsCollectorContainer(trial, pod)
if err != nil {
Expand Down
46 changes: 46 additions & 0 deletions pkg/webhook/v1beta1/pod/inject_webhook_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1019,3 +1019,49 @@ func TestIsPrimaryPod(t *testing.T) {
}
}
}

func TestMutatePodMetadata(t *testing.T) {
mutatedPodLabels := map[string]string{
"custom-pod-label": "custom-value",
"katib-experiment": "katib-value",
consts.LabelTrialName: "test-trial",
}

testCases := []struct {
pod *v1.Pod
trial *trialsv1beta1.Trial
mutatedPod *v1.Pod
testDescription string
}{
{
pod: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{
"custom-pod-label": "custom-value",
},
},
},
trial: &trialsv1beta1.Trial{
ObjectMeta: metav1.ObjectMeta{
Name: "test-trial",
Labels: map[string]string{
"katib-experiment": "katib-value",
},
},
},
mutatedPod: &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Labels: mutatedPodLabels,
},
},
testDescription: "Mutated Pod should contain label from the origin Pod and Trial",
},
}

for _, tc := range testCases {
mutatePodMetadata(tc.pod, tc.trial)
if !reflect.DeepEqual(tc.mutatedPod, tc.pod) {
t.Errorf("Case %v. Expected Pod %v, got %v", tc.testDescription, tc.mutatedPod, tc.pod)
}
}
}
21 changes: 21 additions & 0 deletions pkg/webhook/v1beta1/pod/utils.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import (

common "github.com/kubeflow/katib/pkg/apis/controller/common/v1beta1"
trialsv1beta1 "github.com/kubeflow/katib/pkg/apis/controller/trials/v1beta1"
"github.com/kubeflow/katib/pkg/controller.v1beta1/consts"
mccommon "github.com/kubeflow/katib/pkg/metricscollector/v1beta1/common"
)

Expand Down Expand Up @@ -260,6 +261,26 @@ func mutateMetricsCollectorVolume(pod *v1.Pod, mountPath, sidecarContainerName,
return nil
}

func mutatePodMetadata(pod *v1.Pod, trial *trialsv1beta1.Trial) {
podLabels := map[string]string{}

// Get labels from the created pod.
if pod.Labels != nil {
podLabels = pod.Labels
}

// Get labels from Trial.
for k, v := range trial.Labels {
podLabels[k] = v
}

// Add Trial name label.
podLabels[consts.LabelTrialName] = trial.GetName()

// Append label to the Pod metadata.
pod.Labels = podLabels
}

func getSidecarContainerName(cKind common.CollectorKind) string {
if cKind == common.StdOutCollector || cKind == common.FileCollector {
return mccommon.MetricLoggerCollectorContainerName
Expand Down

0 comments on commit 3a704a0

Please sign in to comment.