Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS handshake error with multi-cluster setup #2402

Closed
freegroup opened this issue Dec 17, 2021 · 5 comments
Closed

TLS handshake error with multi-cluster setup #2402

freegroup opened this issue Dec 17, 2021 · 5 comments
Labels
kind/bug These are bugs.
Milestone

Comments

@freegroup
Copy link
Contributor

I have a multi-cluster setup with with two clusters A and B the related GameServerAllocationPolicy.

Everything seems to work fine. I can use the Allocator endpoints of cluster-A or cluster-B and the GameServerAllocationPolicy are working as expected. On the surface everything is top - I'm very satisfied.

But in the log of the agones-allocator are many many TLS handshake errors. I'm not sure if this comes from the communication between cluster-a and cluster-b - or if this related to an internal sync call?

kubectl logs -n agones-system deployments/agones-allocator -f

2021/12/17 16:01:08 http: TLS handshake error from 100.96.0.1:8819: EOF
2021/12/17 16:01:12 http: TLS handshake error from 100.96.6.1:39636: EOF
2021/12/17 16:01:12 http: TLS handshake error from 100.96.6.1:36797: EOF
2021/12/17 16:01:16 http: TLS handshake error from 10.250.0.148:22066: EOF
.
.
.
.
.
2021/12/17 16:01:46 http: TLS handshake error from 100.96.7.1:52478: EOF
2021/12/17 16:01:48 http: TLS handshake error from 100.96.0.1:46168: EOF
2021/12/17 16:01:52 http: TLS handshake error from 100.96.6.1:59080: EOF
2021/12/17 16:01:56 http: TLS handshake error from 100.96.7.1:32376: EOF
2021/12/17 16:01:57 http: TLS handshake error from 100.96.5.1:51616: EOF
2021/12/17 16:01:58 http: TLS handshake error from 100.96.0.1:15239: EOF
2021/12/17 16:02:06 http: TLS handshake error from 10.250.0.148:18141: EOF
@freegroup freegroup added the kind/bug These are bugs. label Dec 17, 2021
@freegroup freegroup reopened this Dec 20, 2021
@freegroup
Copy link
Contributor Author

I know this is a very long description but unfortunately I am a bit desperate. If I don't get rid of these messages, then I don't even need to think about driving my PoC further. I am totally excited about the setup - but these TLS errors are driving me crazy...

I still have the problem, that I have a lot of TLS handshake error in my agones-allocator log. The multicluster setup seems to work. I get my GameServer if I call the AllocationService how I have defined them in the allocation policy.
(first from cluster01, then cluster02.... then clusterN...) perfect.

My Setup

1.) Create Client_cert for later use

# create root CA
openssl genrsa -out myCA.key 2048
openssl req -x509 -new -nodes -key myCA.key -sha256 -days 1825 -out myCA.pem
# Congratulations, you’re now a CA. Sort of.

# create client-certificate
openssl genrsa -out client_cert.key 2048
openssl req -new -key client_cert.key -out client_cert.csr -config ../certificate/client-certificate.conf
openssl x509 -req -in client_cert.csr -CA myCA.pem -CAkey myCA.key -out client_cert.crt -extensions v3_ca -extfile ../certificate/extfile -sha256 -CAcreateserial

# validate that everything is in 
openssl x509 -noout -text -in client_cert.crt

2.) Deploy allocator-client-ca in agones-system:

# deploy the generated rootCA in all agones cluster
#
kubectl create secret generic allocator-client-ca \
    -n agones-system \
    --from-file=ca.crt=./secrets/myCA.pem

3.) Deploy my-client-cert in gameserver namespace (For test purpose I use only one client-cert for all agones cluster):

# deploy my-client-cert in all agones clusters
#
kubectl create secret generic my-client-cert \
    --from-file=tls.crt=./secrets/client_cert.crt \
    --from-file=tls.key=./secrets/client_cert.key

4.) Add the my-client-cert to the allow-list (allocator-client-ca)

# add them to the allow-list in all agones cluster
#
CERT_FILE_VALUE=$(cat ./secrets/client_cert.crt | base64)
kubectl get secret allocator-client-ca -o json -n agones-system | jq '.data["my-client-cert.crt"]="'${CERT_FILE_VALUE}'"' | kubectl apply -f -

5.) Create certificate request for the TLS communication with my-domain.com

(TLS is working)

apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
  name: allocator-tls
  namespace: agones-system
spec:
  commonName: my-domain.com
  dnsNames:
  - cluster01.my-domain.com
  - allocator.cluster01.my-domain.com
  secretRef:
    name: allocator-tls
---
apiVersion: cert.gardener.cloud/v1alpha1
kind: Certificate
metadata:
  name: allocator-tls-ca
  namespace: agones-system
spec:
  commonName: my-domain.com
  dnsNames:
  - cluster01.my-domain.com
  secretRef:
    name: allocator-tls-ca

6.) Install Agones as follows:

helm repo add agones https://agones.dev/chart/stable
helm repo update
helm install agones \
    --namespace agones-system \
    --create-namespace agones/agones \
    --set agones.featureGates=PlayerTracking=true \
    --set agones.ping.install=false \
    --set agones.controller.healthCheck.initialDelaySeconds=20 \
    --set agones.allocator.generateTLS=false \
    --set agones.allocator.generateClientTLS=false \
    --set agones.allocator.generateTLS=false \
    --set agones.allocator.disableSecretCreation=true \
    --set agones.allocator.service.name=agones-allocator \
    --set agones.allocator.service.serviceType=LoadBalancer

Now I can call the allocation-service endpoint as expected and I get the expected result....but the LOG is very noisy with TLS handshake errors

@cindy52
Copy link
Contributor

cindy52 commented Dec 20, 2021

It might relate to your ELB setup if you are using AWS. When the protocol setup is not correct, the healthcheck will throw TLS handshake errors continually. You may refer to this and change the protocol to SSL instead of TLS for ELB.

@freegroup
Copy link
Contributor Author

freegroup commented Dec 21, 2021

Thanks for your response.
I'm using a https://gardener.cloud setup running with AWS resources...but it seems that the agones.allocator is calling my controll plane components in a kind of broadcast. All the related IP's are pods in the kube-system namespace, calico, vpn, kube-proxy....every single IP in the log is part of the control plane.

Is the allocator broadcast any message or try to scraping some health end-points? I'm a little bit lost

@freegroup
Copy link
Contributor Author

freegroup commented Dec 21, 2021

@cindy52

the annotation service.beta.kubernetes.io/aws-load-balancer-backend-protocol: ssl on the allocator service removes the EOF messages.

Now I have a different error :-/

2021/12/21 09:26:30 http: TLS handshake error from 10.250.18.160:19015: tls: client didn't provide a certificate
2021/12/21 09:26:30 http: TLS handshake error from 10.250.18.160:8295: tls: client didn't provide a certificate
2021/12/21 09:26:40 http: TLS handshake error from 10.250.18.160:51338: tls: client didn't provide a certificate
2021/12/21 09:26:40 http: TLS handshake error from 100.96.2.1:19846: tls: client didn't provide a certificate

it seems, that the agones-allocator do not provide a non mTLS health check endpoint - or?

@freegroup
Copy link
Contributor Author

freegroup commented Dec 21, 2021

@cindy52 thanks that you point me to the right direction +1

In the end the AWS Loadbalancer need an endpoint for the health check. The AWS default is to pick the first port in the ports section of the service declaration - which is, in our case, an gRPC port with mTLS...which brings all the noise.

solution is, to patch the service agones-allocator

  • add a new port http in the ports section as first entry
  • define the endpoint and protocoll to use for the health check as annotation (see annotations)
apiVersion: v1
kind: Service
metadata:
  annotations:
    meta.helm.sh/release-name: agones
    meta.helm.sh/release-namespace: agones-system
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: "10"
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /live
    service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http
  labels:
    app: agones
    app.kubernetes.io/managed-by: Helm
    chart: agones-1.19.0
    component: allocator
    heritage: Helm
    release: agones
  name: agones-allocator
  namespace: agones-system
spec:
  ports:
  - name: http
    port: 8080
    protocol: TCP
    targetPort: 8080
  - name: https
    port: 443
    protocol: TCP
    targetPort: 8443
  selector:
    multicluster.agones.dev/role: allocator
  sessionAffinity: None
  type: LoadBalancer

or

kubectl annotate --overwrite service agones-allocator -n agones-system  'service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval'='10'
kubectl annotate --overwrite service agones-allocator -n agones-system  'service.beta.kubernetes.io/aws-load-balancer-healthcheck-path'='/live'
kubectl annotate --overwrite service agones-allocator -n agones-system  'service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol'='http'


kubectl patch service \
        agones-allocator \
        -n agones-system \
        --type merge \
        --patch \
       '{"spec": {"ports":[{"name":"http", "port":8080, "targetPort":8080},{"name":"https", "port":443, "targetPort":8443}]}}'

@SaitejaTamma SaitejaTamma added this to the 1.20.0 milestone Jan 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug These are bugs.
Projects
None yet
Development

No branches or pull requests

3 participants