Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS handshake error in opentelemetry-operator #1235

Closed
sunilkumar-nfer opened this issue Jun 26, 2024 · 10 comments
Closed

TLS handshake error in opentelemetry-operator #1235

sunilkumar-nfer opened this issue Jun 26, 2024 · 10 comments

Comments

@sunilkumar-nfer
Copy link

sunilkumar-nfer commented Jun 26, 2024

Hi All,

I am very new to open telemetry, i was deploying the operator using this link.

but after deploying from the operator pod am seeing these logs in the pod http: TLS handshake error from x.x.x.x:50516,

but i can see from API server request is received and connection establish from the operator pod.

kube-api server logs

I0626 05:36:00.137545       1 client.go:354] "Received DIAL_REQ" serverID="ed60f6a8-2b5e-4f2b-bf8b-cad4738db" agentID="88bc09a7-e124-447c-b8aa-fabe5825" dialID=8045767835342242194 dialAddress="pod-ip:9443"
I0626 05:36:00.138933       1 client.go:429] "Endpoint connection established" dialID=804576783242194 connectionID=320 dialAddress="pod-ip:9443"

i already try with both cert manager, auto generate certificate and parsing own certificates. but in every case we are receiving same issue.

using this values file for operator

opentelemetry-operator:
  fullnameOverride: optel-operator
  manager:
    image:
      repository: our-registery-url/nference/opentelemetry/opentelemetry-operator
      tag: "1.0.0"
    collectorImage:
      repository: "our-registery-url/opentelemetry/opentelemetry-collector-k8s"
      tag: 0.102.1
    autoInstrumentationImage:
      python:
        repository: "our-registery-url/opentelemetry/autoinstrumentation-python"
        tag: "1.0.0"
    
    resources:
      limits:
        cpu: 100m
        memory: 128Mi
        # ephemeral-storage: 50Mi
      requests:
        cpu: 100m
        memory: 64Mi
  
  kubeRBACProxy:
    enabled: true
    image:
      repository: our-repo-url/opentelemetry/kube-rbac-proxy
      tag: v0.15.0
    ports:
      proxyPort: 8443
    resources:
      limits:
        cpu: 500m
        memory: 128Mi
      requests:
        cpu: 50m
        memory: 64Mi
  
  admissionWebhooks:
    create: true
    servicePort: 443
    failurePolicy: Fail
    secretName: ""

    certManager:
      enabled: create
    
    autoGenerateCert:
      enabled: false
      recreate: false

helm version:- 3.14
Kubernetes version:- 1.28
Go version:- go1.21.9
kubectl:- 0.26.11
chart-version- 0.62.0

i am not sure what i am doing wrong here, can some help here as we need to work on tracing with Auto-instrumentation

@jaronoff97
Copy link
Contributor

can you share the logs from the operator or pods you were saying that indicated a failure? Can you also share the operator version for your custom image?

@sunilkumar-nfer
Copy link
Author

@jaronoff97 opentelemetry-operator version :0.102.0

logs from operator pod:-

{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","message":"Starting the OpenTelemetry Operator","opentelemetry-operator":"0.102.0-17-g2a70ce7c","opentelemetry-collector":"repo-url/opentelemetry/opentelemetry-collector-k8s:0.102.1","opentelemetry-targetallocator":"ghcr.io/open-telemetry/opentelemetry-operator/target-allocator:0.102.0","operator-opamp-bridge":"ghcr.io/open-telemetry/opentelemetry-operator/operator-opamp-bridge:0.102.0","auto-instrumentation-java":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:1.32.1","auto-instrumentation-nodejs":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.51.0","auto-instrumentation-python":"repo.url/opentelemetry/autoinstrumentation-python:1.0.0","auto-instrumentation-dotnet":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-dotnet:1.2.0","auto-instrumentation-go":"ghcr.io/open-telemetry/opentelemetry-go-instrumentation/autoinstrumentation-go:v0.13.0-alpha","auto-instrumentation-apache-httpd":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","auto-instrumentation-nginx":"ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-apache-httpd:1.0.4","feature-gates":"-operator.golang.flags,operator.observability.prometheus","build-date":"2024-06-19T15:32:22Z","go-version":"go1.21.11","go-arch":"amd64","go-os":"linux","labels-filter":[],"annotations-filter":[],"enable-multi-instrumentation":false,"enable-apache-httpd-instrumentation":true,"enable-dotnet-instrumentation":true,"enable-go-instrumentation":false,"enable-python-instrumentation":true,"enable-nginx-instrumentation":false,"enable-nodejs-instrumentation":true,"enable-java-instrumentation":true,"zap-message-key":"message","zap-level-key":"level","zap-time-key":"timestamp","zap-level-format":"uppercase"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"setup","message":"the env var WATCH_NAMESPACE isn't set, watching all namespaces"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"setup","message":"Prometheus CRDs are installed, adding to scheme."}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"setup","message":"Openshift CRDs are not installed, skipping adding to scheme."}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1beta1-opentelemetrycollector"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/convert"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.builder","message":"Conversion webhook enabled","GVK":"opentelemetry.io/v1beta1, Kind=OpenTelemetryCollector"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-instrumentation"}

{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=Instrumentation","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-instrumentation"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-v1-pod"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.builder","message":"Registering a mutating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/mutate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.builder","message":"Registering a validating webhook","GVK":"opentelemetry.io/v1alpha1, Kind=OpAMPBridge","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Registering webhook","path":"/validate-opentelemetry-io-v1alpha1-opampbridge"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"setup","message":"starting manager"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.metrics","message":"Starting metrics server"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","message":"starting server","kind":"health probe","addr":"[::]:8081"}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.metrics","message":"Serving metrics server","bindAddress":"0.0.0.0:8080","secure":false}
{"level":"INFO","timestamp":"2024-06-28T05:32:14Z","logger":"controller-runtime.webhook","message":"Starting webhook server"}
I0628 05:32:14.975489       1 leaderelection.go:250] attempting to acquire leader lease opentelemetry-operator/9f7554c3.opentelemetry.io...
{"level":"INFO","timestamp":"2024-06-28T05:32:15Z","logger":"controller-runtime.certwatcher","message":"Updated current TLS certificate"}
{"level":"INFO","timestamp":"2024-06-28T05:32:15Z","logger":"controller-runtime.webhook","message":"Serving webhook server","host":"","port":9443}
{"level":"INFO","timestamp":"2024-06-28T05:32:15Z","logger":"controller-runtime.certwatcher","message":"Starting certificate watcher"}
2024/06/27 14:26:08 http: TLS handshake error from api-server-ip:34416: EOF
2024/06/27 21:36:39 http: TLS handshake error from api-server-ip:56584: EOF
2024/06/27 21:36:39 http: TLS handshake error from api-server-ip:56134: EOF

logs from cert-manager pod

I0628 00:05:54.729617       1 controller.go:220] "Starting workers" logger="cert-manager" controller="customresourcedefinition" controllerGroup="[apiextensions.k8s.io](http://apiextensions.k8s.io/)" controllerKind="CustomResourceDefinition" worker count=1
I0628 00:05:54.736932       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="validatingwebhookconfiguration" kind="validatingwebhookconfiguration" name="cert-manager-staging-cluster-webhook"
I0628 00:05:54.737483       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="mutatingwebhookconfiguration" kind="mutatingwebhookconfiguration" name="cert-manager-staging-cluster-webhook"
I0628 00:05:54.745247       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="validatingwebhookconfiguration" kind="validatingwebhookconfiguration" name="optel-operator-validation"
I0628 00:05:54.745442       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="mutatingwebhookconfiguration" kind="mutatingwebhookconfiguration" name="optel-operator-mutation"
I0628 00:05:54.751039       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="mutatingwebhookconfiguration" kind="mutatingwebhookconfiguration" name="cert-manager-staging-cluster-webhook"
I0628 00:05:55.222395       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="[opentelemetrycollectors.opentelemetry.io](http://opentelemetrycollectors.opentelemetry.io/)"
I0628 00:05:55.291401       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="[opampbridges.opentelemetry.io](http://opampbridges.opentelemetry.io/)"
I0628 00:05:55.694503       1 reconciler.go:142] "Updated object" logger="cert-manager" kind="customresourcedefinition" kind="customresourcedefinition" name="[opentelemetrycollectors.opentelemetry.io](http://opentelemetrycollectors.opentelemetry.io/)"

just for update this is the GCP setup, and we already whitelist the port which allows master nodes access to port 9443/tcp on worker nodes.

@sunilkumar-nfer
Copy link
Author

@jaronoff97 any update here?

@jaronoff97
Copy link
Contributor

I was away for the weekend. Unfortunately, this is related to a known issue with Go + Kubernetes. You can read more about this issue here. Please comment on this issue if you have the time. I'm going to close this in favor of the operator's tracking issue.

The failures should be intermittent and non-permanent, which is visible in the timestamps of the EOF logs. If this is a permanent TLS failure, please let me know and I will reopen this issue.

@sunilkumar-nfer
Copy link
Author

sunilkumar-nfer commented Jul 2, 2024

@jaronoff97 this is permanent issue for us, i need one confirmation after deploying operator, we deployed the collector using below helm chart and after that we added one python-instrument and added annotation in python-app pod, but we have not seen any entry in collector as well as jeager(storing data from collector)

below are the helm configuration .

opentelemetry-collector:
  fullnameOverride: optel-collector
  mode: deployment
  config:
    exporters:
      
      otlp:
        endpoint: http://jeager-ip:4327
        tls:
          insecure: true
    
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
          http:
            endpoint: 0.0.0.0:4318
    processors:
      memory_limiter:
        check_interval: 1s
        limit_percentage: 75
        spike_limit_percentage: 15
      batch:
        send_batch_size: 10000
        timeout: 10s

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp]
        
  nodeSelector:
    "nodepool-type": "sre"
  tolerations:
  - key: "type"
    operator: "Equal"
    value: "sre"
    effect: "NoSchedule"


  image:
  # If you want to use the core image `otel/opentelemetry-collector`, you also need to change `command.name` value to `otelcol`.
    repository: "repo-url/opentelemetry/opentelemetry-collector-k8s"
    pullPolicy: IfNotPresent
    # Overrides the image tag whose default is the chart appVersion.
    tag: "0.102.1"
  ingress:
    enabled: true
    annotations: {}
    ingressClassName: nginx
    hosts:
     - host: optel-collector.avcd.local
       paths:
         - path: /
           pathType: Prefix
           port: 4318
    tls:
     - hosts:
         - optel-collector.avcd.local
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: python-instrumentation
spec:
  exporter:
    endpoint: http://optel-collector.opentelemetry-operator.svc.cluster.local:4317
  propagators:
    - tracecontext
    - baggage
    - b3
  sampler:
    type: parentbased_traceidratio
    argument: "1"
  python:
    env:
      - name: OTEL_LOGS_EXPORTER
        value: otlp_proto_http
      - name: OTEL_PYTHON_LOGGING_AUTO_INSTRUMENTATION_ENABLED
        value: 'true'
      - name: OTEL_EXPORTER_OTLP_ENDPOINT
        value: http://optel-collector.opentelemetry-operator.svc.cluster.local:4318

---
apiVersion: v1
kind: Service
metadata:
  name: hello-world
  namespace: opentelemetry-operator
spec:
  selector:
    app: hello-world
  ports:
    - port: 80
      targetPort: http-api
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-world
  namespace: opentelemetry-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: hello-world
  strategy:
    type: RollingUpdate
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-python: "true"
      labels:
        app: hello-world
    spec:
      containers:
      - name: hello-world
        image: repo-url/opentelemetry/hello-world:1.0.0
        imagePullPolicy: Always
        resources:
          requests:
            memory: 0.1G
            cpu: 0.1
          limits:
            memory: 0.2G
            cpu: 0.2
        ports:
        - name: http-api
          containerPort: 8080

I am not sure what i did wrong here, i need one confirmation from yourside, if that TLS handshake is coming in operator pod and collector and Instrumentation are created with no error and at application pod side, we can see init containers are started correctly that means operator is working fine(i am assuming) and but still we not receive any trace in jeager. can you please suggest us next step what we can do here to achieve this or we are doing in wrong direction here.

opentelemetry-collector-version: 0.95.0

@sunilkumar-nfer
Copy link
Author

sunilkumar-nfer commented Jul 2, 2024

Screenshot 2024-07-02 at 12 20 52 PM Screenshot from python-app event

NOTE:- we have tested this if we deploy collector in VM and install SDK and configurations at python-app image. (attaching screenshot from successfully receive trace at jeager level) not sure what we wrong at k8s level.

Screenshot 2024-07-02 at 12 28 05 PM

@sunilkumar-nfer
Copy link
Author

@jaronoff97 just for update , i guess i found the issue with python-auto-instrumentation, we enable the python app with debug mode, instrument is not able to trace the request. after removing the debug-mode, we can see the traces.

is it the normal behaviour ?

@jaronoff97
Copy link
Contributor

I'm not positive that's the normal behavior... but if you were using the debug exporter that would be expected. I also would verify that your destination endpoint is correct – endpoint: http://jeager-ip:4327 is misspelled i believe.

@sunilkumar-nfer
Copy link
Author

@jaronoff97
I need your confirmation/help regarding Kubernetes resources. When dealing with Deployment resources, the service name corresponds to the deployment name. However, for non-Deployment resources such as rollouts or replica sets, the service name includes a UUID along with the resource name.

Is there a way to achieve consistent service naming for rollout resources, similar to how it is done for Deployment resources? I am aware that adding the service name as an environment variable in each application's Helm file is a solution, but I am looking for an alternative approach that avoids this.

Could you please help with this?

@jaronoff97
Copy link
Contributor

it sounds like this question is different than the topic of this issue and probably is better asked in the CNCF slack in the otel-helm-charts channel. Could you re-ask the question there and we can continue discussing there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants