Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otel-collector requires HTTP/2 TLS passthrough from Envoy / Contour: should it? #1916

Closed
kevincantu opened this issue Oct 7, 2020 · 12 comments
Labels
bug Something isn't working priority:p2 Medium

Comments

@kevincantu
Copy link

kevincantu commented Oct 7, 2020

I've just gotten started setting up otel-collector for some Kubernetes clusters where we use Envoy (configured via Contour) for routing, and discovered a detail that gave me fits, so I think it's worth laying it all out here. I suspect it may be a gRPC server issue in the collector: some gnarly interaction with Envoy, perhaps?

Expected

What I hoped was that otel-collector could be set up much like this demo with YAGES (a gRPC echo server), where:

  • grpcurl sends gRPC TLS traffic,
  • Envoy terminates TLS,
  • Envoy sends traffic to the upstream as HTTP/2 cleartext gRPC.

I set this up using a Contour HTTPProxy in TCP proxying mode, which relies on SNI to route traffic by domain name:

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: yages
  namespace: monitoring
  labels:
    app: yages
spec:
  selector:
    matchLabels:
      app: yages
  replicas: 1
  template:
    metadata:
      labels:
        app: yages
    spec:
      containers:
      - name: grpcsrv
        image: quay.io/mhausenblas/yages:0.1.0
        ports:
        - containerPort: 9000
          protocol: TCP
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
---
apiVersion: v1
kind: Service
metadata:
  name: yages
  namespace: monitoring
  labels:
    app: yages
spec:
  ports:
  - name: demo
    port: 55682
    protocol: TCP
    targetPort: 9000
  selector:
    app: yages
---
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: yages
  namespace: monitoring
  labels:
    app: yages
spec:
  virtualhost:
    fqdn: yages.staging.test
    tls:
      secretName: yages-wildcard
      #passthrough: true
  tcpproxy:
    services:
    - name: yages
      port: 55682
      # tls: HTTP/1 TLS
      # h2:  HTTP/2 TLS
      # h2c: HTTP/2 cleartext
      protocol: h2c

You can exercise that yages app (to send a ping and receive a pong) with the following grpcurl command:

grpcurl --insecure -v yages.staging.test:443 yages.Echo.Ping

I expected routing just like that to work for otel-collector:

  • my Python demo app using opentelemetry-exporter-otlp sends gRPC TLS traffic,
  • Envoy terminates TLS,
  • Envoy sends traffic to the upstream as HTTP/2 cleartext gRPC...

Actual

But that didn't work.

Instead, when configuring Envoy (via Contour) like that, I saw TCP events in the Envoy access logs like so, but no success:

[2020-10-01T03:18:09.593Z] "- - -" 0 - 0 15 33 - "-" "-" "-" "-" "172.21.5.170:55680"

My sample app (sending traffic to otel-grpc.staging.test:443) only received StatusCode.UNAVAILABLE error responses! (I extended this part of the opentelemetry-exporter-otlp Python library to log those codes.)

Workaround

To make things work, I had to configure Envoy to pass HTTP/2 TLS traffic to the upstream.

Like so:

  • my Python demo app using opentelemetry-exporter-otlp sends gRPC TLS traffic,
  • Envoy passes TLS traffic through to the upstream as HTTP/2 TLS gRPC.
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
  namespace: monitoring
  labels:
    app: opentelemetry
    component: otel-collector-conf
data:
  otel-collector-config: |
    receivers:
      otlp:
        protocols:
          grpc:
            tls_settings:
              cert_file: /tls/cert.pem
              key_file: /tls/key.pem
          http:
    processors:
      batch:
      memory_limiter:
        # Same as --mem-ballast-size-mib CLI argument
        ballast_size_mib: 1024
        # 80% of maximum memory
        limit_mib: 1600
        # 25% of limit
        spike_limit_mib: 512
        check_interval: 5s
    extensions:
      health_check: {}
      zpages:
        endpoint: "0.0.0.0:55679"  # default was localhost only!
    exporters:
      logging:
        logLevel: debug
      honeycomb:
        api_key: "$HONEYCOMB_API_KEY"
        dataset: "apps"
        api_url: "https://api.honeycomb.io"
    service:
      extensions: [health_check, zpages]
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [logging, honeycomb]
---
apiVersion: v1
kind: Service
metadata:
  name: otel-collector
  namespace: monitoring
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  ports:
  - name: zpages
    port: 55679
    # when proxied: http://localhost:8001/api/v1/namespaces/monitoring/services/http:otel-collector:55679/proxy/debug/tracez
  - name: otlp-grpc # Default endpoint for OpenTelemetry receiver.
    port: 55680
  - name: otlp-http
    port: 55681
  - name: jaeger-grpc # Default endpoing for Jaeger gRPC receiver
    port: 14250
  - name: jaeger-thrift-http # Default endpoint for Jaeger HTTP receiver.
    port: 14268
  - name: zipkin # Default endpoint for Zipkin receiver.
    port: 9411
  - name: metrics # Default endpoint for querying metrics.
    port: 8888
  selector:
    component: otel-collector
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: monitoring
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  selector:
    matchLabels:
      app: opentelemetry
      component: otel-collector
  minReadySeconds: 5
  progressDeadlineSeconds: 120
  replicas: 2
  template:
    metadata:
      labels:
        app: opentelemetry
        component: otel-collector
    spec:
      containers:
      - command:
          - "/otelcontribcol"
          - "--log-level=DEBUG"
          - "--config=/conf/otel-collector-config.yaml"
          # Memory Ballast size should be max 1/3 to 1/2 of memory.
          - "--mem-ballast-size-mib=1024"
        #image: otel/opentelemetry-collector-dev:latest
        image: otel/opentelemetry-collector-contrib:0.11.0
        name: otel-collector
        envFrom:
        - secretRef:
            name: otel-collector
        resources:
          limits:
            cpu: 1
            memory: 2Gi
          requests:
            cpu: 200m
            memory: 400Mi
        ports:
        - containerPort: 55679 # Default endpoint for ZPages.
        - containerPort: 55680 # OTLP gRPC receiver.
        - containerPort: 55681 # OTLP HTML/JSON receiver.
        - containerPort: 14250 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 14268 # Default endpoint for Jaeger HTTP receiver.
        - containerPort: 9411  # Default endpoint for Zipkin receiver.
        - containerPort: 8888  # Default endpoint for querying metrics.
        volumeMounts:
        - name: otel-collector-config-vol
          mountPath: /conf
        - name: otel-tls
          mountPath: /tls
        livenessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
        readinessProbe:
          httpGet:
            path: /
            port: 13133 # Health Check extension default port.
      volumes:
        - name: otel-collector-config-vol
          configMap:
            name: otel-collector-conf
            items:
              - key: otel-collector-config
                path: otel-collector-config.yaml
        - name: otel-tls
          secret:
            secretName: otel-wildcard
            items:
              - key: tls.crt
                path: cert.pem
              - key: tls.key
                path: key.pem
---
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: otel-collector
  namespace: monitoring
  annotations:
    kubernetes.io/ingress.class: "contour"
  labels:
    app: opentelemetry
    component: otel-collector
spec:
  virtualhost:
    fqdn: otel.staging.test
    tls:
      #secretName: otel-wildcard
      passthrough: true
  tcpproxy:
    services:
    - name: otel-collector
      port: 55680
      # tls: HTTP/1 TLS
      # h2:  HTTP/2 TLS
      # h2c: HTTP/2 cleartext
      protocol: h2

That is, in addition to the TLS cert setup for otel-collector, this Contour HTTPProxy config change:

   virtualhost:
...
     tls:
-      secretName: otel-wildcard
+      passthrough: true
   tcpproxy:
     services:
     - name: otel-collector
       port: 55680
       # tls: HTTP/1 TLS
       # h2:  HTTP/2 TLS
       # h2c: HTTP/2 cleartext
-      protocol: h2c
+      protocol: h2

Bug?

Specifically, I found that when routing OTLP (gRPC) traffic wrapped in HTTP/2 TLS:

  • the yages echo app works when terminating TLS at Envoy (h2c), but
  • otel-collector does not and needs a TLS passthrough (h2).

I think that means that there's something we could do here to make otel-collector's gRPC server play nicely with Envoy!

@kevincantu kevincantu added the bug Something isn't working label Oct 7, 2020
@kevincantu
Copy link
Author

kevincantu commented Oct 7, 2020

Thanks, by the way, to @pjanotti and @flands who helped me in the Gitter channel, and to this Contour ticket that pointed me at yages!

@kevincantu
Copy link
Author

My spidey sense tells me this cmux issue may be related... 🤷‍♀️

@andrewcheelightstep
Copy link

Hi folks. Just a quick check to see if there is a timeline with this fix since we are running into this as well.

@carlosalberto
Copy link
Contributor

Hey @kevincantu

As I'm not a Countour expert, I tested against 'vanilla' Envoy and I got it working:

  • Client using OTel Python 0.15 (patched to accept self-signed certificates, while doing full TLS verification).
  • Envoy 1.16 doing TLS termination
  • Collector receiving plain text.

I'm wondering if there's something Contour specific or I'm missing something. Let me know ;)

@kevincantu
Copy link
Author

Oh that's encouraging: perhaps something in Envoy 1.16 fixes this? (The version of Contour I last tested with was using an earlier Envoy.)

@carlosalberto
Copy link
Contributor

Hey @kevincantu Any update on this? ;)

@dy009
Copy link

dy009 commented Aug 27, 2021

Any update on this, How can i disable the tls ?

@kevincantu
Copy link
Author

I'm no longer actively working on the same system which used this, so I haven't spun up a cluster to try any of this out again lately.

What I'd try, though, is setting up something like my example above, with a newer version of Contour (and its corresponding newer version of Envoy), and see whether the workaround I showed is still necessary!

Specifically:

  • remove the TLS setttings in the configmap for otel-collector, here (so the collector service isn't expecting TLS connections):
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-conf
...
data:
  otel-collector-config: |
    receivers:
      otlp:
        protocols:
          grpc:
            # remove me?
            tls_settings:
              cert_file: /tls/cert.pem
              key_file: /tls/key.pem
...
  • and configure Contour to terminate the TLS and forward unencrypted gRPC connections to the backend by altering this:
---
apiVersion: projectcontour.io/v1
kind: HTTPProxy
metadata:
  name: otel-collector
...
spec:
  ...
  tcpproxy:
    services:
    - name: otel-collector
      port: 55680
      # tls: HTTP/1 TLS
      # h2:  HTTP/2 TLS
      # h2c: HTTP/2 cleartext
      protocol: h2  # try making me "h2c"?

MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
…etry#1916)

Bumps [github.com/golangci/golangci-lint](https://github.com/golangci/golangci-lint) from 1.39.0 to 1.40.0.
- [Release notes](https://github.com/golangci/golangci-lint/releases)
- [Changelog](https://github.com/golangci/golangci-lint/blob/master/CHANGELOG.md)
- [Commits](golangci/golangci-lint@v1.39.0...v1.40.0)

Signed-off-by: dependabot[bot] <support@github.com>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Anthony Mirabella <a9@aneurysm9.com>
@amitgoyal02
Copy link

How to enable the mTLS for receiver?

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
…y#1916)

Bumps [boto3](https://github.com/boto/boto3) from 1.24.61 to 1.24.62.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](boto/boto3@1.24.61...1.24.62)

---
updated-dependencies:
- dependency-name: boto3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024
@atoulme
Copy link
Contributor

atoulme commented Jul 20, 2024

Closing as inactive, please reopen if this is still being worked on.

@atoulme atoulme closed this as not planned Won't fix, can't repro, duplicate, stale Jul 20, 2024
@jmichalek132
Copy link

I manage to run into this, seems like despite following https://projectcontour.io/docs/main/guides/grpc/ the envoy instance is sending HTTP/1 request to the otel collector instance.

@jmichalek132
Copy link

I manage to run into this, seems like despite following https://projectcontour.io/docs/main/guides/grpc/ the envoy instance is sending HTTP/1 request to the otel collector instance.

I got it working when I switched from using the ingress object to using the contour specific httpproxy object, I'll try to figure out if there's a difference between the configuration they generate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p2 Medium
Projects
None yet
Development

No branches or pull requests

9 participants