Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rename katib DB manager #1006

Merged
merged 1 commit into from
Jan 9, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,8 +100,8 @@ Katib consists of several components as shown below. Each component is running o
Each component communicates with others via GRPC and the API is defined at `pkg/apis/manager/v1alpha3/api.proto`.

- katib: main components.
- katib-manager: GRPC API server of katib which is the DB Interface.
- katib-db: Data storage backend of katib.
- katib-db-manager: GRPC API server of katib which is the DB Interface.
- katib-mysql: Data storage backend of katib using mysql.
- katib-ui: User interface of katib.
- katib-controller: Controller for katib CRDs in Kubernetes.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
FROM golang:alpine AS build-env
# The GOPATH in the image is /go.
ADD . /go/src/github.com/kubeflow/katib
WORKDIR /go/src/github.com/kubeflow/katib/cmd/manager
WORKDIR /go/src/github.com/kubeflow/katib/cmd/db-manager
RUN if [ "$(uname -m)" = "ppc64le" ] || [ "$(uname -m)" = "aarch64" ]; then \
apk --update add git gcc musl-dev && \
go build -o katib-manager ./v1alpha3; \
go build -o katib-db-manager ./v1alpha3; \
else \
go build -o katib-manager ./v1alpha3; \
go build -o katib-db-manager ./v1alpha3; \
fi
RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
if [ "$(uname -m)" = "ppc64le" ]; then \
Expand All @@ -19,6 +19,6 @@ RUN GRPC_HEALTH_PROBE_VERSION=v0.3.1 && \
FROM alpine:3.7
WORKDIR /app
COPY --from=build-env /bin/grpc_health_probe /bin/
COPY --from=build-env /go/src/github.com/kubeflow/katib/cmd/manager/katib-manager /app/
ENTRYPOINT ["./katib-manager"]
COPY --from=build-env /go/src/github.com/kubeflow/katib/cmd/db-manager/katib-db-manager /app/
ENTRYPOINT ["./katib-db-manager"]
CMD ["-w", "kubernetes"]
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ func (s *server) Check(ctx context.Context, in *health_pb.HealthCheckRequest) (*
return &resp, fmt.Errorf("grpc.health.v1.Health can only be accepted if you specify service name.")
}

// Check if connection to katib-db is okay since otherwise manager could not serve most of its methods.
// Check if connection to katib db driver is okay since otherwise manager could not serve most of its methods.
err := dbIf.SelectOne()
if err != nil {
resp.Status = health_pb.HealthCheckResponse_NOT_SERVING
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ def parse_options():
add_help=True
)
parser.add_argument("-s", "--manager_server_addr",
type=str, default="katib-manager:6789")
type=str, default="katib-db-manager:6789")
parser.add_argument("-t", "--trial_name", type=str, default="")
parser.add_argument("-path", "--dir_path", type=str, default="/log")
parser.add_argument("-m", "--metric_names", type=str, default="")
Expand Down
6 changes: 3 additions & 3 deletions docs/proposals/metrics-collector.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ However, the pulled-based design has [some problems](https://github.com/kubeflow

To enhance the extensibility and support EarlyStopping, we propose a new design of the metrics collector.
In the new design, katib use mutating webhook to inject metrics collector container as a sidecar into Job/Tfjob/PytorchJob pod.
The sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-manager and metadata server).
The sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-db-manager and metadata server).

<center>
<img src="../images/metrics-collector-design.png" width="80%">
Expand All @@ -41,7 +41,7 @@ Fig. 1 Architecture of the new design
## Goal

1. **A mutating webhook**: inject metrics collector as a sidecar into master pod.
2. **A metric collector**: collect metrics and store them on the persistent layer (katib-manager).
2. **A metric collector**: collect metrics and store them on the persistent layer (katib-db-manager).
3. **The final metrics** of worker pods should be collected by trail controller and then be stored into trial status.

## API
Expand Down Expand Up @@ -134,7 +134,7 @@ In **Job Level Injecting**,
2. For PytorchJob, the metrics collector sidecar is injected into master template.
3. For TfJob, the metrics collector sidecar is injected into master template if master exists. Otherwise, the sidecar is injected into worker template with 0 index.

After injecting, the sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-manager and metadata server).
After injecting, the sidecar collects metrics of the master and then store them on the persistent layer (e.x. katib-db-manager and metadata server).

### Metric Collector

Expand Down
4 changes: 2 additions & 2 deletions docs/proposals/suggestion.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ Created by [gh-md-toc](https://github.com/ekalinin/github-markdown-toc)

## Background

Katib makes suggestions long-running in v1alpha3 and v1alpha3. And the suggestions need to communicate with katib manager to get experiments and trials from katib-db. This design hurts high availability.
Katib makes suggestions long-running in v1alpha3. And the suggestions need to communicate with katib DB manager to get experiments and trials from katib db driver. This design hurts high availability.

Thus we proposed a new design to implement a CRD for suggestion and remove katib-db from main workflow. The new design simplifies the implmentation of experiment and trial controller, and makes katib Kubernetes native.
Thus we proposed a new design to implement a CRD for suggestion and remove katib db communication from main workflow. The new design simplifies the implmentation of experiment and trial controller, and makes katib Kubernetes native.

This document is to illustrate the details of the new design.

Expand Down
2 changes: 1 addition & 1 deletion examples/v1alpha3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ spec:
- "-n"
- "kubeflow"
- "-m"
- "katib-manager.kubeflow:6789"
- "katib-db-manager.kubeflow:6789"
- "-mn"
- "Validation-accuracy;accuracy"
restartPolicy: Never
Expand Down
2 changes: 1 addition & 1 deletion examples/v1alpha3/custom-metricscollector-example.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ spec:
- -m
- accuracy
- -s
- katib-manager.kubeflow:6789
- katib-db-manager.kubeflow:6789
- -path
- /katib/mnist.log
image: kubeflowkatib/custom-metrics-collector:latest
Expand Down
Original file line number Diff line number Diff line change
@@ -1,38 +1,38 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: katib-manager
name: katib-db-manager
namespace: kubeflow
labels:
app: katib
component: manager
component: db-manager
spec:
replicas: 1
selector:
matchLabels:
app: katib
component: manager
component: db-manager
template:
metadata:
name: katib-manager
name: katib-db-manager
labels:
app: katib
component: manager
component: db-manager
spec:
containers:
- name: katib-manager
image: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-manager
- name: katib-db-manager
image: gcr.io/kubeflow-images-public/katib/v1alpha3/katib-db-manager
imagePullPolicy: IfNotPresent
env:
- name : DB_NAME
value: "mysql"
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: katib-db-secrets
name: katib-mysql-secrets
key: MYSQL_ROOT_PASSWORD
command:
- './katib-manager'
- './katib-db-manager'
ports:
- name: api
containerPort: 6789
Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
apiVersion: v1
kind: Service
metadata:
name: katib-manager
name: katib-db-manager
namespace: kubeflow
labels:
app: katib
component: manager
component: db-manager
spec:
type: ClusterIP
ports:
Expand All @@ -14,4 +14,4 @@ spec:
name: api
selector:
app: katib
component: manager
component: db-manager
Original file line number Diff line number Diff line change
@@ -1,26 +1,26 @@
apiVersion: apps/v1
kind: Deployment
metadata:
name: katib-db
name: katib-mysql
namespace: kubeflow
labels:
app: katib
component: db
component: mysql
spec:
replicas: 1
selector:
matchLabels:
app: katib
component: db
component: mysql
template:
metadata:
name: katib-db
name: katib-mysql
labels:
app: katib
component: db
component: mysql
spec:
containers:
- name: katib-db
- name: katib-mysql
image: mysql:8
args:
- --datadir
Expand All @@ -29,7 +29,7 @@ spec:
- name: MYSQL_ROOT_PASSWORD
valueFrom:
secretKeyRef:
name: katib-db-secrets
name: katib-mysql-secrets
key: MYSQL_ROOT_PASSWORD
- name: MYSQL_ALLOW_EMPTY_PASSWORD
value: "true"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: katib-db-secrets
name: katib-mysql-secrets
namespace: kubeflow
data:
MYSQL_ROOT_PASSWORD: dGVzdA== # "test"
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
apiVersion: v1
kind: Service
metadata:
name: katib-db
name: katib-mysql
namespace: kubeflow
labels:
app: katib
component: db
component: mysql
spec:
type: ClusterIP
ports:
Expand All @@ -14,4 +14,4 @@ spec:
name: dbapi
selector:
app: katib
component: db
component: mysql
51 changes: 25 additions & 26 deletions pkg/common/v1alpha3/katib_manager_util.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,69 +26,68 @@ import (
)

const (
KatibManagerServiceIPEnvName = "KATIB_MANAGER_PORT_6789_TCP_ADDR"
KatibManagerServicePortEnvName = "KATIB_MANAGER_PORT_6789_TCP_PORT"
KatibManagerServiceNamespaceEnvName = "KATIB_MANAGER_NAMESPACE"
KatibManagerService = "katib-manager"
KatibManagerPort = "6789"
ManagerAddr = KatibManagerService + ":" + KatibManagerPort
KatibDBManagerServiceIPEnvName = "KATIB_DB_MANAGER_PORT_6789_TCP_ADDR"
KatibDBManagerServicePortEnvName = "KATIB_DB_MANAGER_PORT_6789_TCP_PORT"
KatibDBManagerService = "katib-db-manager"
KatibDBManagerPort = "6789"
KatibDBManagerAddr = KatibDBManagerService + ":" + KatibDBManagerPort
)

type katibManagerClientAndConn struct {
Conn *grpc.ClientConn
KatibManagerClient api_pb.ManagerClient
type katibDBManagerClientAndConn struct {
Conn *grpc.ClientConn
KatibDBManagerClient api_pb.ManagerClient
}

func GetManagerAddr() string {
func GetDBManagerAddr() string {
ns := consts.DefaultKatibNamespace
if len(ns) == 0 {
addr := os.Getenv(KatibManagerServiceIPEnvName)
port := os.Getenv(KatibManagerServicePortEnvName)
addr := os.Getenv(KatibDBManagerServiceIPEnvName)
port := os.Getenv(KatibDBManagerServicePortEnvName)
if len(addr) > 0 && len(port) > 0 {
return addr + ":" + port
} else {
return ManagerAddr
return KatibDBManagerAddr
}
} else {
return KatibManagerService + "." + ns + ":" + KatibManagerPort
return KatibDBManagerService + "." + ns + ":" + KatibDBManagerPort
}
}

func getKatibManagerClientAndConn() (*katibManagerClientAndConn, error) {
addr := GetManagerAddr()
func getKatibDBManagerClientAndConn() (*katibDBManagerClientAndConn, error) {
addr := GetDBManagerAddr()
conn, err := grpc.Dial(addr, grpc.WithInsecure())
if err != nil {
return nil, err
}
kcc := &katibManagerClientAndConn{
Conn: conn,
KatibManagerClient: api_pb.NewManagerClient(conn),
kcc := &katibDBManagerClientAndConn{
Conn: conn,
KatibDBManagerClient: api_pb.NewManagerClient(conn),
}
return kcc, nil
}

func closeKatibManagerConnection(kcc *katibManagerClientAndConn) {
func closeKatibDBManagerConnection(kcc *katibDBManagerClientAndConn) {
kcc.Conn.Close()
}

func GetObservationLog(request *api_pb.GetObservationLogRequest) (*api_pb.GetObservationLogReply, error) {
ctx := context.Background()
kcc, err := getKatibManagerClientAndConn()
kcc, err := getKatibDBManagerClientAndConn()
if err != nil {
return nil, err
}
defer closeKatibManagerConnection(kcc)
kc := kcc.KatibManagerClient
defer closeKatibDBManagerConnection(kcc)
kc := kcc.KatibDBManagerClient
return kc.GetObservationLog(ctx, request)
}

func DeleteObservationLog(request *api_pb.DeleteObservationLogRequest) (*api_pb.DeleteObservationLogReply, error) {
ctx := context.Background()
kcc, err := getKatibManagerClientAndConn()
kcc, err := getKatibDBManagerClientAndConn()
if err != nil {
return nil, err
}
defer closeKatibManagerConnection(kcc)
kc := kcc.KatibManagerClient
defer closeKatibDBManagerConnection(kcc)
kc := kcc.KatibDBManagerClient
return kc.DeleteObservationLog(ctx, request)
}
8 changes: 4 additions & 4 deletions pkg/db/v1alpha3/common/const.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,12 @@ const (

DBPasswordEnvName = "DB_PASSWORD"

MySQLDBHostEnvName = "KATIB_MYSQL_HOST"
MySQLDBPortEnvName = "KATIB_MYSQL_PORT"
MySQLDatabase = "KATIB_MYSQL_DATABASE"
MySQLDBHostEnvName = "KATIB_MYSQL_DB_HOST"
MySQLDBPortEnvName = "KATIB_MYSQL_DB_PORT"
MySQLDatabase = "KATIB_MYSQL_DB_DATABASE"

DefaultMySQLUser = "root"
DefaultMySQLDatabase = "katib"
DefaultMySQLHost = "katib-db"
DefaultMySQLHost = "katib-mysql"
DefaultMySQLPort = "3306"
)
2 changes: 1 addition & 1 deletion pkg/ui/v1alpha3/backend.go
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ func NewKatibUIHandler() *KatibUIHandler {
}

func (k *KatibUIHandler) connectManager() (*grpc.ClientConn, api_pb_v1alpha3.ManagerClient) {
conn, err := grpc.Dial(common_v1alpha3.ManagerAddr, grpc.WithInsecure())
conn, err := grpc.Dial(common_v1alpha3.KatibDBManagerAddr, grpc.WithInsecure())
if err != nil {
log.Printf("Dial to GRPC failed: %v", err)
return nil, nil
Expand Down
2 changes: 1 addition & 1 deletion pkg/webhook/v1alpha3/pod/inject_webhook.go
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ func (s *sidecarInjector) getMetricsCollectorContainer(trial *trialsv1alpha3.Tri
}

func getMetricsCollectorArgs(trialName, metricName string, mc common.MetricsCollectorSpec) []string {
args := []string{"-t", trialName, "-m", metricName, "-s", katibmanagerv1alpha3.GetManagerAddr()}
args := []string{"-t", trialName, "-m", metricName, "-s", katibmanagerv1alpha3.GetDBManagerAddr()}
if mountPath, _ := getMountPath(mc); mountPath != "" {
args = append(args, "-path", mountPath)
}
Expand Down
6 changes: 2 additions & 4 deletions prow_config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,7 @@ workflows:
- pkg/webhook/v1alpha3/*
- cmd/earlystopping/medianstopping/v1alpha3/*
- cmd/katib-controller/v1alpha3/*
- cmd/manager/v1alpha3/*
- cmd/manager-rest/v1alpha3/*
- cmd/db-manager/v1alpha3/*
- cmd/metricscollector/v1alpha3/*
- cmd/suggestion/bayesianoptimization/v1alpha3/*
- cmd/suggestion/grid/v1alpha3/*
Expand Down Expand Up @@ -71,8 +70,7 @@ workflows:
- pkg/webhook/v1alpha3/*
- cmd/earlystopping/medianstopping/v1alpha3/*
- cmd/katib-controller/v1alpha3/*
- cmd/manager/v1alpha3/*
- cmd/manager-rest/v1alpha3/*
- cmd/db-manager/v1alpha3/*
- cmd/metricscollector/v1alpha3/*
- cmd/suggestion/bayesianoptimization/v1alpha3/*
- cmd/suggestion/grid/v1alpha3/*
Expand Down
Loading