Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pipeline cannot find some services and directory. #11383

Open
922tech opened this issue Nov 16, 2024 · 2 comments
Open

Pipeline cannot find some services and directory. #11383

922tech opened this issue Nov 16, 2024 · 2 comments

Comments

@922tech
Copy link

922tech commented Nov 16, 2024

I had installed KubeFlow pipeline multi-user on kind cluster just like the Readme.md in the manifests repo and it worked just fine; I could run pipelines on it successfully. But all of a sudden my host machine got restarted and I have a problem ever since.
(I have noted down the pods below). As you see KF pods and also their related pods like dex, cert-manager etc. are running but my pipelines get error and fail with this error(my pod with name arka-h2o-test-48gfr-system-dag-driver-4203517700 is a workflow created from a pipeline):

Error logs

> kubectl logs arka-h2o-test-48gfr-system-dag-driver-4203517700
time="2024-11-16T13:02:54.832Z" level=info msg="capturing logs" argo=true
I1116 13:02:55.014546      27 main.go:108] input ComponentSpec:{
  "dag": {
    "tasks": {
      "first-component": {
        "cachingOptions": {
          "enableCache": true
        },
        "componentRef": {
          "name": "comp-first-component"
        },
        "taskInfo": {
          "name": "first-component"
        }
      }
    }
  }
}
I1116 13:02:55.015493      27 main.go:121] input ContainerSpec:{}
I1116 13:02:55.015739      27 main.go:128] input RuntimeConfig:{}
I1116 13:02:55.015863      27 main.go:136] input kubernetesConfig:{}
I1116 13:02:55.016946      27 cache.go:139] Cannot detect ml-pipeline in the same namespace, default to ml-pipeline.kubeflow:8887 as KFP endpoint.
I1116 13:02:55.017002      27 cache.go:116] Connecting to cache endpoint ml-pipeline.kubeflow:8887
I1116 13:02:55.097114      27 env.go:65] cannot find launcher configmap: name="kfp-launcher" namespace="new-profile", will use default config
I1116 13:02:55.097208      27 driver.go:153] PipelineRoot="minio://mlpipeline/v2/artifacts" from default config
I1116 13:02:55.097391      27 config.go:176] Cannot detect minio-service in the same namespace, default to minio-service.kubeflow:9000 as MinIO endpoint.
F1116 13:03:01.252896      27 main.go:79] KFP driver: driver.RootDAG(pipelineName=arka-h2o-test, runID=15c917b2-83c3-4f65-b91d-446e5c2f5ff2, runtimeConfig, componentSpec) failed: Failed GetContextByTypeAndName(type="system.Pipeline", name="arka-h2o-test"): rpc error: code = Internal desc = mysql_real_connect failed: errno: , error: 
time="2024-11-16T13:03:01.836Z" level=info msg="sub-process exited" argo=true error="<nil>"
time="2024-11-16T13:03:01.836Z" level=error msg="cannot save parameter /tmp/outputs/execution-id" argo=true error="open /tmp/outputs/execution-id: no such file or directory"
time="2024-11-16T13:03:01.836Z" level=error msg="cannot save parameter /tmp/outputs/iteration-count" argo=true error="open /tmp/outputs/iteration-count: no such file or directory"
time="2024-11-16T13:03:01.836Z" level=error msg="cannot save parameter /tmp/outputs/condition" argo=true error="open /tmp/outputs/condition: no such file or directory"
Error: exit status 1

Pods

NAMESPACE                   NAME                                                     READY   STATUS      RESTARTS        AGE
auth                        dex-678b97fd68-b45hp                                     1/1     Running     6 (4h15m ago)   5d
cert-manager                cert-manager-77fb85564-qt7m7                             1/1     Running     3 (4h15m ago)   5d2h
cert-manager                cert-manager-cainjector-857964b486-2fjwr                 1/1     Running     5 (4h15m ago)   5d2h
cert-manager                cert-manager-webhook-755d476bb8-9znvj                    1/1     Running     5 (4h15m ago)   5d2h
istio-system                istio-ingressgateway-57c8b6474d-bq7bk                    1/1     Running     3 (4h15m ago)   5d1h
istio-system                istiod-844ccb9bc9-s5mt9                                  1/1     Running     3 (4h15m ago)   5d1h
kube-system                 coredns-6f6b679f8f-2tfxl                                 1/1     Running     3 (4h15m ago)   5d2h
kube-system                 coredns-6f6b679f8f-sx6fz                                 1/1     Running     3 (4h15m ago)   5d2h
kube-system                 etcd-kubeflow-control-plane                              1/1     Running     3 (4h15m ago)   5d2h
kube-system                 kindnet-mjfpx                                            1/1     Running     3 (4h15m ago)   5d2h
kube-system                 kube-apiserver-kubeflow-control-plane                    1/1     Running     3 (4h15m ago)   5d2h
kube-system                 kube-controller-manager-kubeflow-control-plane           1/1     Running     3 (4h15m ago)   5d2h
kube-system                 kube-proxy-94ffb                                         1/1     Running     3 (4h15m ago)   5d2h
kube-system                 kube-scheduler-kubeflow-control-plane                    1/1     Running     3 (4h15m ago)   5d2h
kubeflow-user-example-com   arka-h2o-test-g6xch-system-dag-driver-618842653          0/2     Error       0               4h50m
kubeflow-user-example-com   arka-h2o-test-m256m-system-dag-driver-254026790          0/2     Error       0               152m
kubeflow-user-example-com   arka-h2o-test-q4r2w-system-dag-driver-3243798223         0/2     Error       0               3h2m
kubeflow-user-example-com   arka-h2o-test-sz46k-system-dag-driver-1488861785         0/2     Error       0               3h34m
kubeflow-user-example-com   ml-pipeline-ui-artifact-6b44b849d7-8zp6l                 2/2     Running     4 (4h15m ago)   3d
kubeflow-user-example-com   ml-pipeline-visualizationserver-5fcb5568f-bjrjj          2/2     Running     4 (4h15m ago)   3d
kubeflow                    admission-webhook-deployment-5644dcc957-5x89x            1/1     Running     0               22m
kubeflow                    cache-server-59dfb6fcfc-g9qbz                            2/2     Running     0               44m
kubeflow                    centraldashboard-74fc94fcf4-fnrn8                        2/2     Running     0               44m
kubeflow                    kubeflow-pipelines-profile-controller-7b7b8f44f7-5bp45   1/1     Running     0               44m
kubeflow                    metacontroller-0                                         1/1     Running     0               44m
kubeflow                    metadata-envoy-deployment-74dbc5bdcc-jt2t8               1/1     Running     0               44m
kubeflow                    metadata-grpc-deployment-8496ffb98b-twgjv                2/2     Running     2 (44m ago)     44m
kubeflow                    metadata-writer-7d7dfc5b8d-rl47n                         2/2     Running     0               44m
kubeflow                    minio-7c77bc59b8-gqzzq                                   2/2     Running     0               44m
kubeflow                    ml-pipeline-6d5578b59b-2vpqk                             2/2     Running     0               44m
kubeflow                    ml-pipeline-persistenceagent-f97777b7f-h7hs7             2/2     Running     0               44m
kubeflow                    ml-pipeline-scheduledworkflow-6bbc87d49-dxv9x            2/2     Running     0               44m
kubeflow                    ml-pipeline-ui-6cf7f5d654-bkswj                          2/2     Running     1 (32m ago)     44m
kubeflow                    ml-pipeline-viewer-crd-8685d84fb6-mcxt5                  2/2     Running     1 (44m ago)     44m
kubeflow                    ml-pipeline-visualizationserver-75b9c88599-hqrft         2/2     Running     0               44m
kubeflow                    mysql-758cd66576-f8w58                                   2/2     Running     0               44m
kubeflow                    profiles-deployment-5f46f7c9bb-vcrdr                     3/3     Running     1 (44m ago)     44m
kubeflow                    pvcviewer-controller-manager-74c69655f6-2gz6v            3/3     Running     0               44m
kubeflow                    volumes-web-app-deployment-5b558895d6-8xjb2              2/2     Running     0               24m
kubeflow                    workflow-controller-784cfd9c97-6czgj                     2/2     Running     1 (44m ago)     44m
local-path-storage          local-path-provisioner-57c5987fd4-vvkck                  1/1     Running     4 (4h15m ago)   5d2h
mahdi                       arka-h2o-test-hmlsq-system-dag-driver-2653918956         0/2     Error       0               3h55m
mahdi                       arka-h2o-test-lqx4n-system-dag-driver-423916610          0/2     Error       0               4h21m
mahdi                       arka-h2o-test-nd6gk-system-dag-driver-3179925371         0/2     Error       0               4h33m
mahdi                       debug                                                    1/2     Error       0               49m
mahdi                       ml-pipeline-ui-artifact-6b44b849d7-ps89s                 2/2     Running     6 (4h15m ago)   4d22h
mahdi                       ml-pipeline-visualizationserver-5fcb5568f-r7kzt          2/2     Running     6 (4h15m ago)   4d22h
new-profile                 arka-h2o-test-48gfr-system-dag-driver-4203517700         0/2     Error       0               17m
new-profile                 arka-h2o-test-69lv6-system-dag-driver-1279188562         0/2     Error       0               23m
new-profile                 arka-h2o-test-gs7v7-system-dag-driver-24178473           0/2     Completed   0               36m
new-profile                 arka-h2o-test-jct48-system-dag-driver-2654507976         0/2     Error       0               29m
new-profile                 arka-h2o-test-xqkgf-system-dag-driver-2109334120         0/2     Error       0               21m
new-profile                 ml-pipeline-ui-artifact-6b44b849d7-hpjkg                 2/2     Running     0               151m
new-profile                 ml-pipeline-visualizationserver-5fcb5568f-n5v49          2/2     Running     0               151m
oauth2-proxy                oauth2-proxy-65fbcb849-gcwmg                             1/1     Running     3 (4h15m ago)   5d1h
oauth2-proxy                oauth2-proxy-65fbcb849-x4tfr                             1/1     Running     3 (4h15m ago)   5d1h
profile-name                arka-h2o-test-nk6nf-system-dag-driver-537565250          0/2     Error       0               4h27m
profile-name                ml-pipeline-ui-artifact-6b44b849d7-5gc2c                 2/2     Running     6 (4h15m ago)   3d5h
profile-name                ml-pipeline-visualizationserver-5fcb5568f-cpm86          2/2     Running     6 (4h15m ago)   4d5h

Services in the profile namespace

demo@demo ~/p/manifests (master)> kubectl get svc -n new-profile
NAME                              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
ml-pipeline-ui-artifact           ClusterIP   10.96.116.177   <none>        80/TCP     159m
ml-pipeline-visualizationserver   ClusterIP   10.96.236.168   <none>        8888/TCP   159m
  • How did you deploy Kubeflow Pipelines (KFP)?
    According to the manifest
  • KFP version:
    Latest

Impacted by this bug? Give it a 👍.

@rimolive
Copy link
Member

This error usually means MLMD issues. Please provide logs from metadata-grpc-deployment-8496ffb98b-twgjv pod.

@922tech
Copy link
Author

922tech commented Nov 20, 2024

It has been a while . I actually reinstalled kubeflow(of course after deleting it!)
It worked at first but after a restart I faces the same issue again

NAME                                                    READY   STATUS              RESTARTS          AGE
admission-webhook-deployment-7d8c55f5c-76t8k            0/1     ContainerCreating   0                 26h
cache-server-6b644677df-th998                           1/2     Running             67 (8m3s ago)     26h
centraldashboard-668898f49d-nv7zm                       1/2     Running             40 (8m7s ago)     26h
jupyter-web-app-deployment-844bcb7f4b-cgnvp             1/2     Running             40 (7m54s ago)    26h
katib-controller-5474bbcb9b-fr9v4                       0/1     ContainerCreating   0                 26h
katib-db-manager-5774c6949-sfgzw                        0/1     CrashLoopBackOff    80 (14s ago)      26h
katib-mysql-77b9495867-tzcsv                            1/1     Running             1 (6h31m ago)     26h
katib-ui-58cd497cd5-fdllv                               1/2     Running             40 (8m2s ago)     26h
kserve-models-web-app-6ff4bdbf7d-xdh7s                  1/2     Running             40 (7m56s ago)    26h
kubeflow-pipelines-profile-controller-6476b6cb9-tlmzc   1/1     Running             1 (6h31m ago)     26h
metacontroller-0                                        0/1     CrashLoopBackOff    74 (54s ago)      26h
metadata-envoy-deployment-5bc66b9897-rktpv              1/1     Running             1 (6h31m ago)     26h
metadata-grpc-deployment-c568bd446-krltx                0/2     CrashLoopBackOff    164 (113s ago)    26h
metadata-writer-747d764c6d-m5hzq                        0/2     CrashLoopBackOff    101 (91s ago)     26h
minio-55464b6ddb-hzqm4                                  1/2     Running             40 (8m6s ago)     26h
ml-pipeline-7cb687b7bf-m2ggd                            0/2     CrashLoopBackOff    120 (103s ago)    26h
ml-pipeline-persistenceagent-5f45665cb7-fxt7z           0/2     CrashLoopBackOff    97 (107s ago)     26h
ml-pipeline-scheduledworkflow-6ff678946f-fxb9b          1/2     Running             40 (7m55s ago)    26h
ml-pipeline-ui-6457c76ccb-7xqnv                         1/2     Running             40 (8m5s ago)     26h
ml-pipeline-viewer-crd-748b47f958-p4z5b                 0/2     CrashLoopBackOff    110 (71s ago)     26h
ml-pipeline-visualizationserver-547d67f88c-m5rsp        1/2     Running             40 (8m2s ago)     26h
mysql-7d8b8ff4f4-fgz7d                                  1/2     Running             40 (8m7s ago)     26h
notebook-controller-deployment-656fbbc8fd-r2v2d         0/2     CrashLoopBackOff    120 (2m17s ago)   26h
profiles-deployment-5cdb548b74-nhsdt                    0/3     CrashLoopBackOff    210 (56s ago)     26h
tensorboard-controller-deployment-6756c7f668-zbbj2      1/3     CrashLoopBackOff    123 (2m29s ago)   26h
tensorboards-web-app-deployment-5757cfcf8d-dzrvr        1/2     Running             40 (8m11s ago)    26h
training-operator-665b576ff6-rldf2                      0/1     CrashLoopBackOff    74 (2m14s ago)    26h
volumes-web-app-deployment-59ddff4f84-xdgtm             1/2     Running             40 (7m59s ago)    26h
workflow-controller-859c5ff4d8-rtnw2                    0/2     CrashLoopBackOff    121 (70s ago)     26h

metadata-grpc-deployment-c568bd446-krltx

╰─>$ kubectl logs metadata-grpc-deployment-c568bd446-krltx -n kubeflow
WARNING: Logging before InitGoogleLogging() is written to STDERR
E1120 13:18:52.573863     1 mysql_metadata_source.cc:174] MySQL database was not initialized. Please ensure your MySQL server is running. Also, this error might be caused by starting from MySQL 8.0, mysql_native_password used by MLMD is not supported as a default for authentication plugin. Please follow <https://dev.mysql.com/blog-archive/upgrading-to-mysql-8-0-default-authentication-plugin-considerations/>to fix this issue.
F1120 13:18:52.574599     1 metadata_store_server_main.cc:555] Check failed: absl::OkStatus() == status (OK vs. INTERNAL: mysql_real_connect failed: errno: , error:  [mysql-error-info='']) MetadataStore cannot be created with the given connection config.
*** Check failure stack trace: ***

metadata-writer-747d764c6d-m5hzq

080: Failed to connect to remote host: Connection refused"
Failed to access the Metadata store. Exception: "failed to connect to all addresses; last error: UNKNOWN: ipv4:10.96.142.82:8080: Failed to connect to remote host: Connection refused"
Traceback (most recent call last):
  File "/kfp/metadata_writer/metadata_writer.py", line 68, in <module>
    mlmd_store = connect_to_mlmd()
  File "/kfp/metadata_writer/metadata_helpers.py", line 62, in connect_to_mlmd
    raise RuntimeError('Could not connect to the Metadata store.')
RuntimeError: Could not connect to the Metadata store.

profiles-deployment-5cdb548b74-nhsdt

ctor.go:93: Failed to list *v1.RoleBinding: Get "https://10.96.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: connection refused
E1120 13:21:50.366584       1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190528110200-4f3abb12cae2/tools/cache/reflector.go:93: Failed to list *v1.RoleBinding: Get "https://10.96.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: connection refused
E1120 13:21:51.368604       1 reflector.go:125] pkg/mod/k8s.io/client-go@v0.0.0-20190528110200-4f3abb12cae2/tools/cache/reflector.go:93: Failed to list *v1.RoleBinding: Get "https://10.96.0.1:443/apis/rbac.authorization.k8s.io/v1/rolebindings?limit=500&resourceVersion=0": dial tcp 10.96.0.1:443: connect: connection refused

workflow-controller-859c5ff4d8-rtnw2

time="2024-11-20T13:18:12Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2024-11-20T13:18:12Z" level=info msg="cron config" cronSyncPeriod=10s
time="2024-11-20T13:18:12Z" level=info msg="Memoization caches will be garbage-collected if they have not been hit after" gcAfterNotHitDuration=30s
time="2024-11-20T13:18:12.214Z" level=info msg="not enabling pprof debug endpoints"
time="2024-11-20T13:18:12.215Z" level=fatal msg="Failed to register watch for controller config map: Get \"https://10.96.0.1:443/api/v1/namespaces/kubeflow/configmaps/workflow-controller-configmap\": dial tcp 10.96.0.1:443: connect: connection refused"

dex-c9d5654fb-6dgld (from auth namespace)

time="2024-11-20T13:21:38Z" level=info msg="Dex Version: v2.39.1, Go Version: go1.22.2, Go OS/ARCH: linux amd64"
time="2024-11-20T13:21:38Z" level=info msg="config using log level: debug"
time="2024-11-20T13:21:38Z" level=info msg="config issuer: http://dex.auth.svc.cluster.local:5556/dex"
time="2024-11-20T13:21:38Z" level=info msg="kubernetes client apiVersion = dex.coreos.com/v1"
failed to initialize storage: cannot get kubernetes version: Get "https://[10.96.0.1]:443/version": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

cert-manager-cainjector-7cdfb576c5-mvk8k (from cert-manager namespace)

      --vmodule pattern=N,...                               comma-separated list of pattern=N settings for file-filtered logging (only works for text log format)

E1120 13:24:20.193874       1 main.go:40] "error executing command" err="failed to get API group resources: unable to retrieve the complete list of server APIs: apiextensions.k8s.io/v1: Get \"https://10.96.0.1:443/apis/apiextensions.k8s.io/v1\": dial tcp 10.96.0.1:443: i/o timeout" logger="cert-manager"

All the pods seem to have the same issue that either cannot find some endpoint or cannot retrieve anything from it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants