Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Charm stuck in WaitingStatus because of error initializing configuration '/envoy/envoy.yaml' #114

Closed
DnPlas opened this issue Jun 27, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@DnPlas
Copy link
Contributor

DnPlas commented Jun 27, 2024

Bug Description

It looks like the configuration in /envoy/envoy.yaml' is avoiding the service to start correctly, leaving the unit in WaitingStatus w/o a clear resolution path.

From the logs I can see Unable to parse JSON as proto (INVALID_ARGUMENT:(static_resources.listeners[0].filter_chains[0].filters[0].typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager for type Any, which suggests that this value is not recognized.

To Reproduce

  1. Deploy juju envoy --channel latest/edge --trust
  2. Deploy juju mlmd --channel latest/edge --trust
  3. Relate juju relate envoy mlmd
  4. Observe

Environment

  1. microk8s 1.29-strict/stable
  2. juju 3.4/stable (3.4.3)

Relevant Log Output

# ---- juju debug-log
unit-envoy-0: 21:22:06 ERROR unit.envoy/0.juju-log grpc:0: execute_components caught unhandled exception when executing configure_charm for envoy-component
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/charm_reconciler.py", line 92, in reconcile
    component_item.component.configure_charm(event)
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/component.py", line 50, in configure_charm
    self._configure_unit(event)
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 273, in _configure_unit
    self._update_layer()
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/charmed_kubeflow_chisme/components/pebble_component.py", line 284, in _update_layer
    container.replan()
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/ops/model.py", line 2211, in replan
    self._pebble.replan_services()
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/ops/pebble.py", line 1993, in replan_services
    return self._services_action('replan', [], timeout, delay)
  File "/var/lib/juju/agents/unit-envoy-0/charm/venv/ops/pebble.py", line 2090, in _services_action
    raise ChangeError(change.err, change)
ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "envoy" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-06-27T21:22:06Z INFO Most recent service output:
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:249] initializing epoch 0 (hot restart version=11.104)
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:251] statically linked extensions:
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:253]   access_loggers: envoy.file_access_log,envoy.http_grpc_access_log,envoy.tcp_grpc_access_log
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:256]   filters.http: envoy.buffer,envoy.cors,envoy.csrf,envoy.ext_authz,envoy.fault,envoy.filters.http.adaptive_concurrency,envoy.filters.http.dynamic_forward_proxy,envoy.filters.http.grpc_http1_reverse_bridge,envoy.filters.http.grpc_stats,envoy.filters.http.header_to_metadata,envoy.filters.http.jwt_authn,envoy.filters.http.original_src,envoy.filters.http.rbac,envoy.filters.http.tap,envoy.grpc_http1_bridge,envoy.grpc_json_transcoder,envoy.grpc_web,envoy.gzip,envoy.health_check,envoy.http_dynamo_filter,envoy.ip_tagging,envoy.lua,envoy.rate_limit,envoy.router,envoy.squash
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:259]   filters.listener: envoy.listener.http_inspector,envoy.listener.original_dst,envoy.listener.original_src,envoy.listener.proxy_protocol,envoy.listener.tls_inspector
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:262]   filters.network: envoy.client_ssl_auth,envoy.echo,envoy.ext_authz,envoy.filters.network.dubbo_proxy,envoy.filters.network.mysql_proxy,envoy.filters.network.rbac,envoy.filters.network.sni_cluster,envoy.filters.network.thrift_proxy,envoy.filters.network.zookeeper_proxy,envoy.http_connection_manager,envoy.mongo_proxy,envoy.ratelimit,envoy.redis_proxy,envoy.tcp_proxy
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:264]   stat_sinks: envoy.dog_statsd,envoy.metrics_service,envoy.stat_sinks.hystrix,envoy.statsd
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:266]   tracers: envoy.dynamic.ot,envoy.lightstep,envoy.tracers.datadog,envoy.tracers.opencensus,envoy.tracers.xray,envoy.zipkin
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:269]   transport_sockets.downstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:272]   transport_sockets.upstream: envoy.transport_sockets.alts,envoy.transport_sockets.raw_buffer,envoy.transport_sockets.tap,envoy.transport_sockets.tls,raw_buffer,tls
    [2024-06-27 21:22:06.011][14][info][main] [source/server/server.cc:278] buffer implementation: new
    [2024-06-27 21:22:06.014][14][critical][main] [source/server/server.cc:95] error initializing configuration '/envoy/envoy.yaml': Unable to parse JSON as proto (INVALID_ARGUMENT:(static_resources.listeners[0].filter_chains[0].filters[0].typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager for type Any): {"static_resources":{"clusters":[{"load_assignment":{"endpoints":[{"lb_endpoints":[{"endpoint":{"address":{"socket_address":{"port_value":8080,"address":"metadata-grpc-service"}}}}]}],"cluster_name":"metadata-grpc"},"lb_policy":"round_robin","type":"logical_dns","typed_extension_protocol_options":{"envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{"explicit_http_config":{"http2_protocol_options":{}},"@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"}},"name":"metadata-cluster","connect_timeout":"30.0s"}],"listeners":[{"filter_chains":[{"filters":[{"typed_config":{"http_filters":[{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb"},"name":"envoy.filters.http.grpc_web"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors"},"name":"envoy.filters.http.cors"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"},"name":"envoy.filters.http.router"}],"route_config":{"virtual_hosts":[{"routes":[{"typed_per_filter_config":{"envoy.filter.http.cors":{"max_age":"1728000","allow_headers":"keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout","allow_methods":"GET, PUT, DELETE, POST, OPTIONS","@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.CorsPolicy","expose_headers":"custom-header-1,grpc-status,grpc-message","allow_origin_string_match":[{"safe_regex":{"regex":".*"}}]}},"match":{"prefix":"/"},"route":{"max_stream_duration":{"grpc_timeout_header_max":"0s"},"cluster":"metadata-cluster"}}],"name":"local_service","domains":["*"]}],"name":"local_route"},"stat_prefix":"ingress_http","@type":"type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager","codec_type":"auto"},"name":"envoy.filters.network.http_connection_manager"}]}],"name":"listener_0","address":{"socket_address":{"port_value":9090,"address":"0.0.0.0"}}}]},"admin":{"address":{"socket_address":{"port_value":9901,"address":"0.0.0.0"}},"access_log":{"typed_config":{"path":"/tmp/admin_access.log","@type":"type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"},"name":"admin_access"}}}
    [2024-06-27 21:22:06.014][14][info][main] [source/server/server.cc:594] exiting
    Unable to parse JSON as proto (INVALID_ARGUMENT:(static_resources.listeners[0].filter_chains[0].filters[0].typed_config): invalid value Invalid type URL, unknown type: envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager for type Any): {"static_resources":{"clusters":[{"load_assignment":{"endpoints":[{"lb_endpoints":[{"endpoint":{"address":{"socket_address":{"port_value":8080,"address":"metadata-grpc-service"}}}}]}],"cluster_name":"metadata-grpc"},"lb_policy":"round_robin","type":"logical_dns","typed_extension_protocol_options":{"envoy.extensions.upstreams.http.v3.HttpProtocolOptions":{"explicit_http_config":{"http2_protocol_options":{}},"@type":"type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"}},"name":"metadata-cluster","connect_timeout":"30.0s"}],"listeners":[{"filter_chains":[{"filters":[{"typed_config":{"http_filters":[{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.grpc_web.v3.GrpcWeb"},"name":"envoy.filters.http.grpc_web"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.Cors"},"name":"envoy.filters.http.cors"},{"typed_config":{"@type":"type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"},"name":"envoy.filters.http.router"}],"route_config":{"virtual_hosts":[{"routes":[{"typed_per_filter_config":{"envoy.filter.http.cors":{"max_age":"1728000","allow_headers":"keep-alive,user-agent,cache-control,content-type,content-transfer-encoding,custom-header-1,x-accept-content-transfer-encoding,x-accept-response-streaming,x-user-agent,x-grpc-web,grpc-timeout","allow_methods":"GET, PUT, DELETE, POST, OPTIONS","@type":"type.googleapis.com/envoy.extensions.filters.http.cors.v3.CorsPolicy","expose_headers":"custom-header-1,grpc-status,grpc-message","allow_origin_string_match":[{"safe_regex":{"regex":".*"}}]}},"match":{"prefix":"/"},"route":{"max_stream_duration":{"grpc_timeout_header_max":"0s"},"cluster":"metadata-cluster"}}],"name":"local_service","domains":["*"]}],"name":"local_route"},"stat_prefix":"ingress_http","@type":"type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager","codec_type":"auto"},"name":"envoy.filters.network.http_connection_manager"}]}],"name":"listener_0","address":{"socket_address":{"port_value":9090,"address":"0.0.0.0"}}}]},"admin":{"address":{"socket_address":{"port_value":9901,"address":"0.0.0.0"}},"access_log":{"typed_config":{"path":"/tmp/admin_access.log","@type":"type.googleapis.com/envoy.extensions.access_loggers.file.v3.FileAccessLog"},"name":"admin_access"}}}
2024-06-27T21:22:06Z ERROR cannot start service: exited quickly with code 1
-----

# ---- juju status

Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s-343    microk8s/localhost  3.4.3    unsupported  21:30:29Z

App    Version  Status   Scale  Charm  Channel      Rev  Address         Exposed  Message
envoy           waiting      1  envoy  latest/edge  230  10.152.183.165  no       installing agent
mlmd            active       1  mlmd   latest/edge  197  10.152.183.98   no

Unit      Workload  Agent  Address      Ports  Message
envoy/0*  waiting   idle   10.1.60.154         [envoy-component] Waiting for Pebble services (envoy).  If this persists, it could be a blocking configuration error.
mlmd/0*   active    idle   10.1.60.153

Additional Context

Strangely enough, this is not being captured by envoy's CI - I have ran two attempts on HEAD and they both succeed. This behaviour was caught by the kfp-operators CI here. I was also able to reproduce it locally.

@DnPlas DnPlas added the bug Something isn't working label Jun 27, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5907.

This message was autogenerated

@DnPlas
Copy link
Contributor Author

DnPlas commented Jun 27, 2024

envoy 2.0/stable

In a model with envoy 2.0/stable this issue is not present:

Model     Controller  Cloud/Region        Version  SLA          Timestamp
kubeflow  uk8s-343    microk8s/localhost  3.4.3    unsupported  21:51:30Z

App                   Version                Status  Scale  Charm          Channel       Rev  Address         Exposed  Message
envoy                 res:oci-image@cc06b3e  active      1  envoy          2.0/stable    194  10.152.183.154  no
istio-ingressgateway                         active      1  istio-gateway  1.17/stable  1000  10.152.183.112  no
istio-pilot                                  active      1  istio-pilot    1.17/stable  1011  10.152.183.166  no
mlmd                  res:oci-image@44abc5d  active      1  mlmd           1.14/stable   127  10.152.183.167  no

Unit                     Workload  Agent  Address      Ports          Message
envoy/1*                 active    idle   10.1.60.145  9090,9901/TCP
istio-ingressgateway/0*  active    idle   10.1.60.158
istio-pilot/0*           active    idle   10.1.60.156
mlmd/1*                  active    idle   10.1.60.157  8080/TCP

Integration provider     Requirer                          Interface          Type     Message
istio-pilot:ingress      envoy:ingress                     ingress            regular
istio-pilot:istio-pilot  istio-ingressgateway:istio-pilot  k8s-service        regular
istio-pilot:peers        istio-pilot:peers                 istio_pilot_peers  peer
mlmd:grpc                envoy:grpc                        grpc               regular

I noticed that in this version of the charm, we block the unit if the relation with istio-pilot is missing, so I had to deploy istio-operators in order to make the envoy unit go to active, but after that the reported issue is not present.

@orfeas-k
Copy link
Contributor

That's weird because when when the envoy.yaml was updated, it had been tested by myself and the PR's reviewer #102 (review).

@orfeas-k
Copy link
Contributor

orfeas-k commented Jun 28, 2024

Ok so something's wrong with the charm's image, I tried the following and that made the charm go active

jref envoy --resource oci-image=gcr.io/ml-pipeline/metadata-envoy:2.2.0

which is the charm's default image.

Confirmed by deploying the envoy charm with that image and it went to active

jd envoy --channel latest/edge --trust --resource oci-image=gcr.io/ml-pipeline/metadata-envoy:2.2.0

Charm publishing

So it looks like charm's publishing has been messed up.

Publishing from track/2.0

You can see that the charm was published using the oci-image 104
https://github.com/canonical/envoy-operator/actions/runs/9662384372/job/26652904155#step:5:180

Publishing from main

You can see that the charm was published again using the oci-image 104
https://github.com/canonical/envoy-operator/actions/runs/9701404321/job/26782877651#step:5:184

What happened exactly
  1. Updated envoy in latest/edge using a new image. That created a new resource (oci-image:102) and charm was published using that new resource.
  2. Something happened that created newer resources. I'm not sure what is that but we can see that publish jobs from track/2.0 use as resource oci-image:104
  3. Update envoy charm in latest/edge (with no change in the image). The publish job used also as resource the latest available meaning oci-image:104.

This results in both charms being published using the same image although their metadata.yaml files define a different one.

Conclusion

@orfeas-k
Copy link
Contributor

Charm resource publishing history

The charm has been published with the following resources:

Not sure also what 103 is, since the charm image in main didnt' change after 10th June

@orfeas-k
Copy link
Contributor

orfeas-k commented Jul 2, 2024

After transfering this charm to kubeflow-charmers, we re-released envoy with the resource it had been released with when we updated the manifests executing:

╰─$ charmcraft release envoy --revision=231 --channel latest/edge --resource=oci-image:102

We 'd be looking in the root cause of this as part of canonical/bundle-kubeflow#962.

@orfeas-k orfeas-k closed this as completed Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants