Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All traffic sent to mixer as TCP #3099

Closed
Stono opened this issue Feb 2, 2018 · 93 comments
Closed

All traffic sent to mixer as TCP #3099

Stono opened this issue Feb 2, 2018 · 93 comments

Comments

@Stono
Copy link
Contributor

Stono commented Feb 2, 2018

Hey,
On a fresh install of istio 0.5.0 on k8s 1.8.6 in GKE with mTLS enabled, i'm seeing errors in mixer:

2018-02-02T11:56:49.051875Z	info	getting kubeconfig from: ""	{"adapter": "handler.kubernetesenv.istio-system"}
W0202 11:56:49.051913       1 client_config.go:529] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2018-02-02T11:56:49.052169Z	info	getting k8s client from config	{"adapter": "handler.kubernetesenv.istio-system"}
2018-02-02T11:56:49.052923Z	info	Waiting for kubernetes cache sync...	{"adapter": "handler.kubernetesenv.istio-system"}
2018-02-02T11:56:49.153124Z	info	Cache sync successful.	{"adapter": "handler.kubernetesenv.istio-system"}
2018-02-02T11:56:49.153867Z	info	Published snapshot[2] with 4 rules, 3 handlers, previously 0 rules
2018-02-02T11:56:49.153929Z	info	serving prometheus metrics on :42422	{"adapter": "handler.prometheus.istio-system"}
2018-02-02T11:57:22.360156Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-02T11:57:23.364267Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"

and errors in pilot:

2018-02-02T11:56:26.720600Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:50296->10.192.11.232:15003: read: connection reset by peer
2018-02-02T11:56:56.721834Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:44674->10.192.11.232:15003: read: connection reset by peer
2018-02-02T11:57:26.725496Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:39132->10.192.11.232:15003: read: connection reset by peer
2018-02-02T11:57:56.726764Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:33110->10.192.11.232:15003: read: connection reset by peer
2018-02-02T11:58:26.728359Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:54060->10.192.11.232:15003: read: connection reset by peer
2018-02-02T11:58:56.730797Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:41572->10.192.11.232:15003: read: connection reset by peer
2018-02-02T11:59:26.733221Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:50644->10.192.11.232:15003: read: connection reset by peer
2018-02-02T11:59:56.735592Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:51056->10.192.11.232:15003: read: connection reset by peer
2018-02-02T12:00:26.738272Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:51146->10.192.11.232:15003: read: connection reset by peer
2018-02-02T12:00:56.740580Z	info	Unable to retrieve availability zone from pilot: Get http://istio-pilot:15003/v1/az/istio-proxy/pilot~~.~.svc.cluster.local: read tcp 10.194.7.22:51246->10.192.11.232:15003: read: connection reset by peer
@Stono Stono changed the title istio 0.5.0 mixer errors istio 0.5.0 mixer and pilot errors Feb 2, 2018
@Stono Stono closed this as completed Feb 2, 2018
@Stono Stono reopened this Feb 2, 2018
@Stono
Copy link
Contributor Author

Stono commented Feb 2, 2018

And in grafana we get nothhing, further confirming mixer isn't working :/

screen shot 2018-02-02 at 15 19 55

I turned on trace logging in mixer and see nothing, so all the side cars are failing to / not sending to it.

@Stono
Copy link
Contributor Author

Stono commented Feb 2, 2018

A little more randomness, this is one of my app pods, i noticed in the istio-proxy logs:

2018-02-02T15:44:13.954603Z     info    Unable to retrieve availability zone from pilot: Get http://istio-pilot.istio-system:15003/v1/az/at-conference-application/sidecar~10.194.2.19~at-consumer-platform-5788bcf599-5sxrf.default~default.svc.cluster.local: read tcp 10.194.2.19:43234->10.192.15.199:15003: read: connection reset by peer

However if I go onto the pod, I can curl that fine:

[root@at-sauron-service-search-68d7896cbc-qjjv2 app]# curl http://istio-pilot.istio-system:15003/v1/az/at-conference-application/sidecar~10.194.2.19~at-consumer-platform-5788bcf599-5sxrf.default~default.svc.cluster.local
europe-west1/europe-west1-d

So this is something to do with the side car making that call. It totally feels like something to do with mtls to me.

@Stono Stono changed the title istio 0.5.0 mixer and pilot errors istio 0.5.0 mixer and pilot errors, no metrics being sent to mixer Feb 2, 2018
@douglas-reid
Copy link
Contributor

@Stono what does your deployment look like? The logs indicate that Mixer is not able to start serving gRPC, which would prevent it from receiving any Reports(). That indicates that something went wrong with deployment.

@Stono
Copy link
Contributor Author

Stono commented Feb 2, 2018

@douglas-reid that's a very broad question!

We are:

  • GKE
  • K8s 1.8.6
  • installed istio-auth.yaml without any modifications

As this cluster isn't live, I went for a disruptive but fully clean installed, I totally removed 0.4.0, the istio-system namespace and any trace of istio related stuff i could find. But this is where I end up.

I've tested rolling back to 0.4.0 and that works.

Everything seems to have deployed fine:

❯ kis get pods
NAME                                READY     STATUS    RESTARTS   AGE
grafana-b5775f57d-jjcpb             1/1       Running   0          3h
istio-ca-5796758d78-6hpss           1/1       Running   0          3h
istio-mixer-58984b7c8-mcv5j         3/3       Running   0          3h
istio-pilot-65c8fb777c-msqdg        2/2       Running   0          3h
jaeger-deployment-f8db5dd99-2vlgn   1/1       Running   0          2h
prometheus-65c97b8569-xv54v         1/1       Running   0          2h
❯ kis get services
NAME               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                            AGE
grafana            ClusterIP   10.192.8.121    <none>        3000/TCP                                                           3h
istio-mixer        ClusterIP   10.192.6.175    <none>        9091/TCP,15004/TCP,9093/TCP,9094/TCP,9102/TCP,9125/UDP,42422/TCP   3h
istio-pilot        ClusterIP   10.192.15.25    <none>        15003/TCP,8080/TCP,9093/TCP,443/TCP                                3h
jaeger-agent       ClusterIP   None            <none>        5775/UDP,6831/UDP,6832/UDP                                         3h
jaeger-collector   ClusterIP   10.192.13.179   <none>        14267/TCP,14268/TCP,9411/TCP                                       3h
jaeger-query       ClusterIP   10.192.10.172   <none>        80/TCP                                                             3h
prometheus         ClusterIP   10.192.1.142    <none>        9090/TCP                                                           3h
zipkin             ClusterIP   None            <none>        9411/TCP                                                           3h
❯ kis get configmap
NAME                                    DATA      AGE
istio                                   1         3h
istio-ingress-controller-leader-istio   0         3h
istio-mixer                             1         3h
prometheus                              1         3h

Mixer config at boot:

Mixer started with
MaxMessageSize: 1048576
MaxConcurrentStreams: 1024
APIWorkerPoolSize: 1024
AdapterWorkerPoolSize: 1024
ExpressionEvalCacheSize: 1024
APIPort: 9091
MonitoringPort: 9093
SingleThreaded: false
ConfigStoreURL: k8s://
ConfigDefaultNamespace: istio-system
ConfigIdentityAttribute: destination.service
ConfigIdentityAttributeDomain: svc.cluster.local
LoggingOptions: log.Options{OutputPaths:[]string{"stdout"}, ErrorOutputPaths:[]string{"stderr"}, RotateOutputPath:"", RotationMaxSize:104857600, RotationMaxAge:30, RotationMaxBackups:1000, JSONEncoding:false, IncludeCallerSourceLocation:f
alse, LogGrpc:true, stackTraceLevel:"none", outputLevel:"info"}
TracingOptions: tracing.Options{ZipkinURL:"http://zipkin:9411/api/v1/spans", JaegerURL:"", LogTraceSpans:false}

@Stono
Copy link
Contributor Author

Stono commented Feb 2, 2018

Happy to provide any more information, logs etc if you tell me what you need.

All the ports seem to be listening on mixer:

/ # netstat -plunt
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 127.0.0.1:15000         0.0.0.0:*               LISTEN      -
tcp        0      0 0.0.0.0:15004           0.0.0.0:*               LISTEN      -
tcp        0      0 :::9091                 :::*                    LISTEN      -
tcp        0      0 :::9093                 :::*                    LISTEN      -
tcp        0      0 :::9125                 :::*                    LISTEN      1/statsd_exporter
tcp        0      0 :::9102                 :::*                    LISTEN      1/statsd_exporter
tcp        0      0 :::42422                :::*                    LISTEN      -
udp        0      0 :::9125                 :::*                                1/statsd_exporter

@Stono
Copy link
Contributor Author

Stono commented Feb 2, 2018

@douglas-reid it feels like theres something funky going on with mtls too

If i go onto the mixer pod directly:

/ # nc 127.0.0.1 15004 < file
/ #

Absolutely fine

On a pod which doesn't have an envoy sidecar:

/prometheus $ nc istio-mixer.istio-system 15004 < file
/prometheus $

Again, fine.

However on a pod that does have envoy:

[root@at-conference-application-668579f59d-5599c app]# nc istio-mixer.istio-system 15004 < file
Ncat: Connection reset by peer.

And the fact that loads of envoys are doing this:

2018-02-02T18:11:07.819720Z     info    Unable to retrieve availability zone from pilot: Get http://istio-pilot.istio-system:15003/v1/az/at-conference-application/sidecar~10.194.0.13~at-conference-application-668579f59d-5599c.default~default.svc.cluster.local: read tcp 10.194.0.13:42226->10.192.15.25:15003: read: connection reset by peer
2018-02-02T18:11:37.822188Z     info    Unable to retrieve availability zone from pilot: Get http://istio-pilot.istio-system:15003/v1/az/at-conference-application/sidecar~10.194.0.13~at-conference-application-668579f59d-5599c.default~default.svc.cluster.local: read tcp 10.194.0.13:42250->10.192.15.25:15003: read: connection reset by peer
2018-02-02T18:12:07.824907Z     info    Unable to retrieve availability zone from pilot: Get http://istio-pilot.istio-system:15003/v1/az/at-conference-application/sidecar~10.194.0.13~at-conference-application-668579f59d-5599c.default~default.svc.cluster.local: read tcp 10.194.0.13:42266->10.192.15.25:15003: read: connection reset by peer
2018-02-02T18:12:37.828588Z     info    Unable to retrieve availability zone from pilot: Get http://istio-pilot.istio-system:15003/v1/az/at-conference-application/sidecar~10.194.0.13~at-conference-application-668579f59d-5599c.default~default.svc.cluster.local: read tcp 10.194.0.13:42294->10.192.15.25:15003: read: connection reset by peer

Not really understanding the architecture enough though i don't know if i'm just chasing shadows.

@Stono
Copy link
Contributor Author

Stono commented Feb 2, 2018

And the plot thickens...
These are the logs from istio-proxy on the mixer pod:

2018-02-02T17:02:33.160485Z     info    Epoch 1 exited normally
2018-02-02T17:02:33.160614Z     warn    Failed to delete config file /etc/istio/proxy/envoy-rev1.json for 1, remove /etc/istio/proxy/envoy-rev1.json: no such file or directory
2018-02-02T18:01:22.048401Z     info    watchFileEvents: "/etc/certs//..2982_02_02_18_01_22.001916409": CREATE
2018-02-02T18:01:22.048641Z     info    watchFileEvents: "/etc/certs//..2982_02_02_18_01_22.001916409": MODIFY|ATTRIB
2018-02-02T18:01:22.048654Z     info    watchFileEvents: "/etc/certs//..data_tmp": RENAME
2018-02-02T18:01:22.048694Z     info    watchFileEvents: "/etc/certs//..data": CREATE
2018-02-02T18:01:22.048702Z     info    watchFileEvents: "/etc/certs//..2982_02_02_17_02_20.089200026": DELETE
2018-02-02T18:01:32.048748Z     info    watchFileEvents: notifying
2018-02-02T18:01:32.049896Z     info    Received new config, resetting budget
2018-02-02T18:01:32.049932Z     info    Reconciling configuration (budget 10)
2018-02-02T18:01:32.050125Z     info    Epoch 3 starting

@douglas-reid
Copy link
Contributor

@rkpagadala do you have any idea what may be going wrong with the proxy setup here for Mixer (and mTLS) ?

@Stono
Copy link
Contributor Author

Stono commented Feb 2, 2018

More than happy to send over my terraform & istio installation files if needed, just give me an email address.

@douglas-reid
Copy link
Contributor

Is this possibly related to #3139?

@Stono
Copy link
Contributor Author

Stono commented Feb 3, 2018

I'm not sure they're related @douglas-reid - I don't have any errors about it failing to read the certs. Here are my logs from istio-proxy:

2018-02-03T13:47:18.794184Z     info    Epoch 17 exited normally
2018-02-03T13:47:18.794395Z     warn    Failed to delete config file /etc/istio/proxy/envoy-rev17.json for 17, remove /etc/istio/proxy/envoy-rev17.json: no such file or directory
2018-02-03T14:45:51.653885Z     info    watchFileEvents: "/etc/certs//..2982_03_02_14_45_51.591460944": CREATE
2018-02-03T14:45:51.653934Z     info    watchFileEvents: "/etc/certs//..2982_03_02_14_45_51.591460944": MODIFY|ATTRIB
2018-02-03T14:45:51.653984Z     info    watchFileEvents: "/etc/certs//..data_tmp": RENAME
2018-02-03T14:45:51.654061Z     info    watchFileEvents: "/etc/certs//..data": CREATE
2018-02-03T14:45:51.654151Z     info    watchFileEvents: "/etc/certs//..2982_03_02_13_47_05.130313966": DELETE
2018-02-03T14:46:01.654060Z     info    watchFileEvents: notifying
2018-02-03T14:46:01.654423Z     info    Received new config, resetting budget
2018-02-03T14:46:01.654487Z     info    Reconciling configuration (budget 10)
2018-02-03T14:46:01.654588Z     info    Epoch 19 starting
2018-02-03T14:46:01.654684Z     info    Envoy command: [-c /etc/istio/proxy/envoy_mixer_auth.json --restart-epoch 19 --drain-time-s 2 --parent-shutdown-time-s 3 --service-cluster istio-proxy --service-node mixer~~.~.svc.cluster.local --max-obj-name-len 189 -l off]
2018-02-03T14:46:04.759410Z     info    Epoch 18 exited normally
2018-02-03T14:46:04.759522Z     warn    Failed to delete config file /etc/istio/proxy/envoy-rev18.json for 18, remove /etc/istio/proxy/envoy-rev18.json: no such file or directory
2018-02-03T15:44:41.687273Z     info    watchFileEvents: "/etc/certs//..2982_03_02_15_44_41.635030018": CREATE
2018-02-03T15:44:41.687322Z     info    watchFileEvents: "/etc/certs//..2982_03_02_15_44_41.635030018": MODIFY|ATTRIB
2018-02-03T15:44:41.687338Z     info    watchFileEvents: "/etc/certs//..data_tmp": RENAME
2018-02-03T15:44:41.687548Z     info    watchFileEvents: "/etc/certs//..data": CREATE
2018-02-03T15:44:41.687662Z     info    watchFileEvents: "/etc/certs//..2982_03_02_14_45_51.591460944": DELETE
2018-02-03T15:44:51.687448Z     info    watchFileEvents: notifying
2018-02-03T15:44:51.687763Z     info    Received new config, resetting budget
2018-02-03T15:44:51.687786Z     info    Reconciling configuration (budget 10)
2018-02-03T15:44:51.687843Z     info    Epoch 20 starting
2018-02-03T15:44:51.687950Z     info    Envoy command: [-c /etc/istio/proxy/envoy_mixer_auth.json --restart-epoch 20 --drain-time-s 2 --parent-shutdown-time-s 3 --service-cluster istio-proxy --service-node mixer~~.~.svc.cluster.local --max-obj-name-len 189 -l off]
2018-02-03T15:44:54.794074Z     info    Epoch 19 exited normally
2018-02-03T15:44:54.794302Z     warn    Failed to delete config file /etc/istio/proxy/envoy-rev19.json for 19, remove /etc/istio/proxy/envoy-rev19.json: no such file or directory

I find it strange however that the config file isn't there. so this is on a working pod:

❯ k exec -it at-sauron-web-777f855596-vmwqx -c istio-proxy /bin/bash

I have no name!@at-sauron-web-777f855596-vmwqx:/$ ls -al /etc/istio/proxy/
total 8
drwxrwxrwt 2 root root   60 Feb  3 14:59 .
drwxr-xr-x 3 root root 4096 Jan 30 03:36 ..
-rw-r--r-- 1 1337 root 3209 Feb  3 14:59 envoy-rev19.json

And this is on the mixers istio-proxy:

❯ kis exec -it istio-mixer-f6f9fc987-q66w5 -c istio-proxy /bin/bash
root@istio-mixer-f6f9fc987-q66w5:/# ls -al /etc/istio/proxy/
total 24
drwxr-xr-x 2 root root 4096 Jan 30 03:36 .
drwxr-xr-x 3 root root 4096 Jan 30 03:36 ..
-rw-r--r-- 2 root root  771 Jan 30 03:28 envoy_mixer.json
-rw-r--r-- 1 root root  988 Jan 30 03:28 envoy_mixer_auth.json
-rw-r--r-- 4 root root  771 Jan 30 03:28 envoy_pilot.json
-rw-r--r-- 3 root root  988 Jan 30 03:28 envoy_pilot_auth.json

@Stono
Copy link
Contributor Author

Stono commented Feb 3, 2018

Actually you know, you may be onto something. I'm unable to hit other mtls services from the mixer pod (it's my understanding that i should be able to).

In this example at-urs-service is a little api with istio-proxy + mtls, and i'm trying to wget it from the mixer container.

/ # wget at-urs-service.default
Connecting to at-urs-service.default (10.192.3.221:80)
wget: error getting response: Connection reset by peer

And the mixer log is just littered with:

2018-02-03T13:58:51.167118Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:51.168866Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:55.057021Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:55.083331Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:56.058662Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:56.084720Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:56.928225Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:57.062342Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:57.929143Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:58:58.064151Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:59:06.995572Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:59:07.160657Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:59:07.997044Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-02-03T13:59:08.161292Z     info    grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"

@sakshigoel12
Copy link
Contributor

@douglas-reid do we understand the problem yet?

@douglas-reid
Copy link
Contributor

@sakshigoel12 I'm hoping that @wattli can take a look and see if anything obvious jumps out here.

@Stono
Copy link
Contributor Author

Stono commented Feb 5, 2018

Happy to get on calls, screen share, provide logs, click my heels... anything i can do to help debug just ask.

@linsun
Copy link
Member

linsun commented Feb 5, 2018

the pilot error should not cause you any harm. BTW, I was able to get metrics in istio dashboard with istio 0.5 on ibm cloud container service.

@linsun
Copy link
Member

linsun commented Feb 5, 2018

make sure you use manual inject to inject the side car...

@Stono
Copy link
Contributor Author

Stono commented Feb 5, 2018

@linsun I have injected the side car :-) It's not the pilot errors that concern me to be honest, it's the gRPC errors in Mixer.

Out of curiosity what is your setup? As we're specifically using MUTUAL_TLS on an RBAC enabled cluster

@wattli
Copy link
Contributor

wattli commented Feb 5, 2018

@Stono , how about we disable mixer, is there still an issue?

@douglas-reid
Copy link
Contributor

@Stono mTLS + RBAC should work out of the box (we have a series of e2e tests specifically for this use case: https://prow.istio.io/?job=e2e-suite-rbac-auth). They should be exercising this functionality fairly heavily.

When you installed 0.5.0 is it possible that any installation steps were missed ?

@nmittler
Copy link
Contributor

nmittler commented Mar 5, 2018

@jeffmendoza any updates on this?

@Stono
Copy link
Contributor Author

Stono commented Mar 6, 2018

Hey,
Just as an FYI I upgraded to 0.6.0 in dev this morning and still experience the same behaviour.

@nmittler
Copy link
Contributor

nmittler commented Mar 6, 2018

This is a pretty old bug ... we need to make some progress here.

It looks like the protocol selection happens here. So the ports appear to be named correctly (starting with http/https and separated by -). The ports in the attached xds yaml seem to have the same names, but apart from that I'm not sure what to look for.

@kyessenov would you mind taking a look? The problem is that all traffic shows up in mixer as tcp rather than http, even though the ports appear to be named correctly.

@day0ops
Copy link
Contributor

day0ops commented Mar 7, 2018

I too am seeing a similar behaviour with 0.5.1 with mTLS. Our services in the mesh talking to services outside the mesh (such as cassandra or kafka) fail with io.grpc.StatusRuntimeException: UNAVAILABLE: io exception

2018-03-06T12:27:46.836697Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:27:47.836875Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:27:51.335759Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:27:52.435371Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:27:56.839246Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:27:58.035478Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:28:01.270762Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:28:02.342267Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:28:06.760703Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"
2018-03-06T12:28:07.835735Z	info	grpc: Server.Serve failed to create ServerTransport: connection error: desc = "transport: http2Server.HandleStreams failed to receive the preface from client: EOF"

@nmittler nmittler changed the title istio 0.5.0 mixer and pilot errors, no metrics being sent to mixer All traffic sent to mixer as TCP Mar 13, 2018
@nmittler
Copy link
Contributor

@mandarjog would you mind taking a look?

@douglas-reid
Copy link
Contributor

That last comment looks like traffic being sent to mixer not using gRPC. What does the filter on the egress look like?

@day0ops
Copy link
Contributor

day0ops commented Mar 13, 2018

I have no defined any egress rules for this

@day0ops
Copy link
Contributor

day0ops commented Mar 13, 2018

Im not entirely sure if cassandra traffic is gRPC anyway

@Stono
Copy link
Contributor Author

Stono commented Mar 16, 2018

@douglas-reid @nixgadget is that not a different issue to this one? This issue is about http services sending metrics as TCP.

The gRPC issue you're experiencing went away for me when I did a "clean" install of istio (which is great as I was never able to reproduce it).

@mandarjog
Copy link
Contributor

The grpc issue is most likely because of mtls mismatch. Mixer was expecting mtls and envoy was not sending it mtls.

@mandarjog
Copy link
Contributor

@Stono sorry for the late reply. Just coming back to this.
Can you confirm the following

conference, rate-of-sale and sauron are correctly being reported as http services.
ingress-nginx-external is reported as tcp and http (perhaps correct?)
and
consumer, search and ingress-nginx-internal are tcp only, which was the original problem.

@nmittler
Copy link
Contributor

nmittler commented Apr 2, 2018

@mandarjog We now have an nginx ingress test that might help in your testing. See #4569

@mandarjog
Copy link
Contributor

Going back thru the PR, I can confirm that the configuration generated by pilot is incorrect.
It is indeed creating an inbound tcp listener.

The outbound listener is correctly identified as 'http', but since metrics reporting happens at inbound, we see tcp only metrics for at-consumer-platform-service

continuing investigation ...

@Stono
Copy link
Contributor Author

Stono commented Apr 3, 2018

Just as an FYI on this never ending thread (lol), I've upgrade to 0.7.1 and this issue is still there. Is the name of the port the only thing which defines if it is http or tcp?

@mandarjog mandarjog assigned mandarjog and unassigned jeffmendoza Apr 3, 2018
@mandarjog
Copy link
Contributor

@Stono can you get the latest ./istio-proxy-cfg.py and get configuration for at-consumer-platform-service and at at-sauron-web-service.
There is some more information that we gather now.

@mandarjog
Copy link
Contributor

I would also like to see route rules and destination policies that pertain to at-consumer-platform-service and at-sauron-web-service

@mandarjog
Copy link
Contributor

And istio-pilot logs, thanks.

@Stono
Copy link
Contributor Author

Stono commented Apr 4, 2018

Sent to your email

@mandarjog
Copy link
Contributor

@Stono can you use gcr.io/istio-testing/pilot:0.7.1d image for the real pilot ?
and then grep for Listener collision in pilot logs...

I think port 8088 is a red herring. The issue probably is listener collision...
If you get rid of from at-consumer-platform then you should no longer see those messages.
It would be great if you can confirm thru logs.

 - name: https-web-mtls
    port: 443
    targetPort: http-web

@Stono
Copy link
Contributor Author

Stono commented Apr 6, 2018

Hey,
I don't understand though, we have 2 services OK and 3 services not, the 2 OK services use 8080, and the others use a different port. When we update one of the other deployments to port 8080, it starts working OK too... surely that amount of correlation isn't a red herring? :o

Anyway; there are collisions in the logs but as you're aware the real pilot is very noisy. There seems to be collisions for every port on every service.

2018-04-06T07:12:11.681891Z     error   Listener collision for tcp://10.202.0.5:8080
2018-04-06T07:12:11.810463Z     error   Listener collision for tcp://10.202.0.2:8080
2018-04-06T07:12:12.338550Z     error   Listener collision for tcp://10.202.0.4:8088
2018-04-06T07:12:12.366208Z     error   Listener collision for tcp://10.202.1.19:8080
2018-04-06T07:12:12.500750Z     error   Listener collision for tcp://10.202.5.36:8375
2018-04-06T07:12:12.532300Z     error   Listener collision for tcp://10.202.1.14:8080
2018-04-06T07:12:12.650609Z     error   Listener collision for tcp://10.202.5.34:8080
2018-04-06T07:12:12.815743Z     error   Listener collision for tcp://10.202.1.16:8080
2018-04-06T07:12:12.874556Z     error   Listener collision for tcp://10.202.0.11:8080
2018-04-06T07:12:12.909071Z     error   Listener collision for tcp://10.202.1.36:8080
2018-04-06T07:12:12.915496Z     error   Listener collision for tcp://10.202.0.7:8375
2018-04-06T07:12:13.300029Z     error   Listener collision for tcp://10.202.1.15:8303
2018-04-06T07:12:13.307789Z     error   Listener collision for tcp://10.202.1.33:8080
2018-04-06T07:12:13.369762Z     error   Listener collision for tcp://10.202.5.35:8080
2018-04-06T07:12:13.402037Z     error   Listener collision for tcp://10.202.0.5:8080
2018-04-06T07:12:13.434464Z     error   Listener collision for tcp://10.202.0.2:8080
2018-04-06T07:12:13.772201Z     error   Listener collision for tcp://10.202.0.4:8088
2018-04-06T07:12:13.802512Z     error   Listener collision for tcp://10.202.1.19:8080
2018-04-06T07:12:13.973932Z     error   Listener collision for tcp://10.202.1.14:8080
2018-04-06T07:12:14.063323Z     error   Listener collision for tcp://10.202.5.36:8375
2018-04-06T07:12:14.149915Z     error   Listener collision for tcp://10.202.0.11:8080

I've sent you both sets of logs via email to look into yourself.

@Stono
Copy link
Contributor Author

Stono commented Apr 6, 2018

I decided to try and test your hypothesis... at-search-one-service.default.svc.cluster.local is one of the ones reporting as TCP.

I edited the service and removed the mtls port, and it was still being reported as tcp after a few minutes of traffic to it. So I'm not sure that the mtls port on the service has anything to do with it as completely removing it had no effect.

I then put the mtls port back, and edited the deployment to use 8080 instead of 8375 on the containerPort, and now search-one is correctly appearing as a HTTP service.

I changed the port back to 8375 and it stopped working again.

I am happy to demonstrate this on hangouts for you later!

@Stono
Copy link
Contributor Author

Stono commented Apr 6, 2018

Oh what fun this has been, but thanks to @mandarjog we have finally got to the bottom of it!

@mandarjog may want to elaborate more but at a high level all of our services exposed two ports http-web and https-web on 80 and 443 respectively. Both of these ports then forwarded to the same containerPort called http-web. That containerPort was either: 8080, 8088 or 4375.

This multiplexing is not actually supported by istio and resulted in a "port collision", only one of them could be used and the code base basically took the first one from the list, and that first one was sometimes http-web (correct, http) or https-web (incorrect, tcp). We've not correlated why changing the port of some stuff from 8088, to 8080 changed which service port was picked in the case of the collision but the working theory is that the change impacted the order of the list thus changing the one at index 0.

The https-web port is due to us doing some testing of routing traffic from outside the mesh from an F5 load balancer on premise using mTLS. Happy to explain why we were trying this out on a call but long story short we don't need it at the moment so can simply remove it which (i have tested) resolves the problem and all services start getting reported correctly as http!

@mandarjog up to you if you want to close this issue or leave it open and link your PR with improvements to logging to help others debug this issue more easily in the future?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests