ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet #1289

372046933 · 2018-07-31T06:27:07Z

After the follow deployment script,
curl https://raw.githubusercontent.com/kubeflow/kubeflow/v0.2.2/scripts/deploy.sh | bash.
Ambassador failed to start on one node.

 kubectl logs --namespace kubeflow ambassador-849fb9c8c5-kgrkb ambassador
./entrypoint.sh: set: line 65: can't access tty; job control turned off
2018-07-31 05:46:50 kubewatch 0.30.1 INFO: generating config with gencount 1 (4 changes)
2018-07-31 05:46:56 kubewatch 0.30.1 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7383625940>: Failed to establish a new connection: [Errno -3] Try again',))
2018-07-31 05:46:56 kubewatch 0.30.1 INFO: Scout reports {"latest_version": "0.30.1", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7383625940>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1533016011.063859}
[2018-07-31 05:46:56.133][10][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-07-31 05:46:56.133][10][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-07-31 05:46:56.150][10][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-07-31 05:46:56.150][10][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
AMBASSADOR: starting diagd
AMBASSADOR: starting Envoy
AMBASSADOR: waiting
PIDS: 11:diagd 12:envoy 13:kubewatch
[2018-07-31 05:46:56.556][14][info][main] source/server/server.cc:184] initializing epoch 0 (hot restart version=9.200.16384.127.options=capacity=16384, num_slots=8209 hash=228984379728933363)
[2018-07-31 05:46:57.574][14][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-07-31 05:46:57.767][14][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-07-31 05:46:57.767][14][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
[2018-07-31 05:46:57.769][14][info][main] source/server/server.cc:359] starting main dispatch loop
2018-07-31 05:47:04 diagd 0.30.1 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0bee6d95f8>: Failed to establish a new connection: [Errno -3] Try again',))
2018-07-31 05:47:04 diagd 0.30.1 INFO: Scout reports {"latest_version": "0.30.1", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f0bee6d95f8>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1533016019.808133}
2018-07-31 05:47:14 kubewatch 0.30.1 INFO: generating config with gencount 2 (4 changes)
2018-07-31 05:47:19 kubewatch 0.30.1 WARNING: Scout: could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6fbb8468d0>: Failed to establish a new connection: [Errno -3] Try again',))
2018-07-31 05:47:19 kubewatch 0.30.1 INFO: Scout reports {"latest_version": "0.30.1", "exception": "could not post report: HTTPSConnectionPool(host='kubernaut.io', port=443): Max retries exceeded with url: /scout (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f6fbb8468d0>: Failed to establish a new connection: [Errno -3] Try again',))", "cached": false, "timestamp": 1533016034.702365}
[2018-07-31 05:47:19.770][26][info][upstream] source/common/upstream/cluster_manager_impl.cc:132] cm init: all clusters initialized
[2018-07-31 05:47:19.771][26][info][config] source/server/configuration_impl.cc:55] loading 1 listener(s)
[2018-07-31 05:47:19.788][26][info][config] source/server/configuration_impl.cc:95] loading tracing configuration
[2018-07-31 05:47:19.788][26][info][config] source/server/configuration_impl.cc:122] loading stats sink configuration
unable to initialize hot restart: previous envoy process is still initializing
starting hot-restarter with target: /application/start-envoy.sh
forking and execing new child process at epoch 0
forked new child process with PID=14
got SIGHUP
forking and execing new child process at epoch 1
forked new child process with PID=27
got SIGCHLD
PID=27 exited with code=1
Due to abnormal exit, force killing all child processes and exiting
force killing PID=14
exiting due to lack of child processes
AMBASSADOR: envoy exited with status 1
Here's the envoy.json we were trying to run with:
{
  "listeners": [

    {
      "address": "tcp://0.0.0.0:80",

      "filters": [
        {
          "type": "read",
          "name": "http_connection_manager",
          "config": {"codec_type": "auto",
            "stat_prefix": "ingress_http",
            "access_log": [
              {
                "format": "ACCESS [%START_TIME%] \"%REQ(:METHOD)% %REQ(X-ENVOY-ORIGINAL-PATH?:PATH)% %PROTOCOL%\" %RESPONSE_CODE% %RESPONSE_FLAGS% %BYTES_RECEIVED% %BYTES_SENT% %DURATION% %RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)% \"%REQ(X-FORWARDED-FOR)%\" \"%REQ(USER-AGENT)%\" \"%REQ(X-REQUEST-ID)%\" \"%REQ(:AUTHORITY)%\" \"%UPSTREAM_HOST%\"\n",
                "path": "/dev/fd/1"
              }
            ],
            "route_config": {
              "virtual_hosts": [
                {
                  "name": "backend",
                  "domains": ["*"],"routes": [

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/check_ready","prefix_rewrite": "/ambassador/v0/check_ready",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/check_alive","prefix_rewrite": "/ambassador/v0/check_alive",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/ambassador/v0/","prefix_rewrite": "/ambassador/v0/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_127_0_0_1_8877", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/tfjobs/","prefix_rewrite": "/tfjobs/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_tf_job_dashboard_default", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/k8s/ui/","prefix_rewrite": "/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_kubernetes_dashboard_kube_system_otls", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 300000,"prefix": "/user/","prefix_rewrite": "/user/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_tf_hub_lb_default", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 300000,"prefix": "/hub/","prefix_rewrite": "/hub/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_tf_hub_lb_default", "weight": 100.0 }

                          ]
                      }

                    }
                    ,

                    {
                      "timeout_ms": 3000,"prefix": "/","prefix_rewrite": "/",
                      "weighted_clusters": {
                          "clusters": [

                                 { "name": "cluster_centraldashboard_default", "weight": 100.0 }

                          ]
                      }

                    }


                  ]
                }
              ]
            },
            "filters": [
              {
                "name": "cors",
                "config": {}
              },{"type": "decoder",
                "name": "router",
                "config": {}
              }
            ]
          }
        }
      ]
    }
  ],
  "admin": {
    "address": "tcp://127.0.0.1:8001",
    "access_log_path": "/tmp/admin_access_log"
  },
  "cluster_manager": {
    "clusters": [
      {
        "name": "cluster_127_0_0_1_8877",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://127.0.0.1:8877"
          }

        ]},
      {
        "name": "cluster_centraldashboard_default",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://centraldashboard.default:80"
          }

        ]},
      {
        "name": "cluster_kubernetes_dashboard_kube_system_otls",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://kubernetes-dashboard.kube-system:443"
          }

        ],
        "ssl_context": {

        }},
      {
        "name": "cluster_tf_hub_lb_default",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://tf-hub-lb.default:80"
          }

        ]},
      {
        "name": "cluster_tf_job_dashboard_default",
        "connect_timeout_ms": 3000,
        "type": "strict_dns",
        "lb_type": "round_robin",
        "hosts": [
          {
            "url": "tcp://tf-job-dashboard.default:80"
          }

        ]}

    ]
  },
  "statsd_udp_ip_address": "127.0.0.1:8125",
  "stats_flush_interval_ms": 1000
}AMBASSADOR: shutting down

The text was updated successfully, but these errors were encountered:

jlewi · 2018-07-31T15:30:24Z

Does ambassador start on the other nodes? What happened when it restarted? Did it just crash loop?

Are you running on minikube? Do you have RBAC installed (see #734)

/cc @kflynn

jlewi · 2018-07-31T15:31:59Z

Is kube-dns running? see #1134?

372046933 · 2018-08-01T01:40:41Z

@jlewi Thanks for your kindly reply. I checked DNS service on every node by executing nslookup kubernetes on busybox pod, and finally found that the node where ambassador cashed have a wrong DNS resolver address. The root cause was the configuration of kubelet, which used erroneous --cluster-dns

jlewi · 2018-08-02T07:14:32Z

Great glad its fixed.

* Add owners for tf and pytorch operators * Update test owners

jlewi added the kind/question label Jul 31, 2018

jlewi closed this as completed Aug 2, 2018

jlewi changed the title ~~Ambassador failed to start using the 0.2.2 deploy script~~ ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet Aug 2, 2018

surajkota pushed a commit to surajkota/kubeflow that referenced this issue Jun 13, 2022

OWNERS for TF and PyTorch operators (kubeflow#1289)

9793a76

* Add owners for tf and pytorch operators * Update test owners

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet #1289

ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet #1289

372046933 commented Jul 31, 2018

jlewi commented Jul 31, 2018

jlewi commented Jul 31, 2018

372046933 commented Aug 1, 2018

jlewi commented Aug 2, 2018

ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet #1289

ambassador crashing on node with wrong DNS resolver address due to misconfigured kubelet #1289

Comments

372046933 commented Jul 31, 2018

jlewi commented Jul 31, 2018

jlewi commented Jul 31, 2018

372046933 commented Aug 1, 2018

jlewi commented Aug 2, 2018