Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timing issue causing failures in Pods created by Jobs #136

Open
damomurf opened this issue May 14, 2018 · 12 comments
Open

Timing issue causing failures in Pods created by Jobs #136

damomurf opened this issue May 14, 2018 · 12 comments

Comments

@damomurf
Copy link

As per #46 I'm seeing similar issues where requests to kube2iam frequently fail with a 404 error from kube2iam soon after Job pod creation. I'm able to reliably reproduce the issue with a fairly simple Job definition in a couple of different Kubernetes clusters running Kubernetes v0.8.7.

I've tried with a number of different kube2iam versions: v0.8.2, v0.8.4, v0.10.0 and all exhibit similar behaviour.

Adding a delay to the beginning of the job's execution can help, but at other times a sleep of 5 seconds appears to still not be enough to get around the issue:

(Real IAM role arn replaced to protect the innocent)

Job definition:

apiVersion: batch/v1
kind: Job
metadata:
  name: iam-role-test
spec:
  completions: 50
  parallelism: 5
  backoffLimit: 2
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName
    spec:
      restartPolicy: Never
      containers:
      - name: test
        image: governmentpaas/curl-ssl
        command:
        - sh
        - -c 
        - "curl -v -f -H 'Accept: application/json' http://169.254.169.254/latest/meta-data/iam/security-credentials/arn:aws:iam::accountId:role/roleName"

Individual failed Pod log output appears as: (curl -v is messing with the formatting, but you get the idea):


*   Trying 169.254.169.254...
* TCP_NODELAY set
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to 169.254.169.254 (169.254.169.254) port 80 (#0)
> GET /latest/meta-data/iam/security-credentials/arn:aws:iam::accountId:role/roleName HTTP/1.1
> Host: 169.254.169.254
> User-Agent: curl/7.55.0
> Accept: application/json
> 
* The requested URL returned error: 404 Not Found
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
* Closing connection 0
curl: (22) The requested URL returned error: 404 Not Found

Some of the 50 job executions (the success/failure rate seems to be fairly random) fail with the curl command simply receiving a 404 Not Found as described above. Adding a sleep 5 prior to the curl command fixes the issue most of the time.

I've enabled --log-level=debug and --debug options and the relevant log entries that mention the related role are listed in this gist: https://gist.github.com/damomurf/30468bfc1bd595720cb3c9e44946bc19

Hopefully this is sufficient detail provided on the issue, as requested in #46.

@damomurf
Copy link
Author

damomurf commented May 14, 2018

One other fact that may be helpful: the equivalent deployment (with a while loop to keep invoking the curl command and keep the pods running) does not exhibit the same behaviour:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: iam-role-test-deploy
spec:
  replicas: 50
  template:
    metadata:
      annotations:
        iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName
      labels:
        app: iam-role-test
    spec:
      containers:
      - name: test
        image: governmentpaas/curl-ssl
        command:
        - /bin/sh
        - -c
        - "while [ true ]; do curl -v -f -H 'Accept: application/json' http://169.254.169.254/latest/meta-data/iam/security-credentials/iam.amazonaws.com/role: arn:aws:iam::accountId:role/roleName; sleep 60; done"

@pms1969
Copy link

pms1969 commented May 31, 2018

We are seeing very similar behaviour in a CronJob. Time to delve deep I guess.

@roffe
Copy link

roffe commented May 31, 2018

Observing the same problem here but with pods in deployments, the first start in all deployments that begins with downloading stuff from S3 via aws-cli always fails the first start, after the first restart they come up fine.

common for all of them is a 404 in the kube2iam logs & error:

fatal error: Unable to locate credentials

Kube2iam: 0.10.0
Kubernetes 1.9.6
Kops 1.9.0

@roffe
Copy link

roffe commented May 31, 2018

I've posted more info in #122

@blimmer
Copy link

blimmer commented Jun 4, 2018

I can reproduce this issue 99% of the time with a workload that generates a lot of pods (20-ish) all at once with kube2iam annotations. If I make the pod sleep for 30 seconds before trying to use my IAM role, it seems to work around the problem.

When I only spin up a few pods at a time, I don't see the problem.

@szuecs
Copy link

szuecs commented Jun 7, 2018

We also see similar problems for all applications which for example want to read from S3 at start.
Our workaround is to use an initContainer that tries to access the resource.
The nature of a distributed system (like Kubernetes) is to have race conditions all over the place.
The question is how to tackle these in general. Prefetching the credentials from AWS would be one way, I am not sure what kube2iam is doing.

@blimmer
Copy link

blimmer commented Jun 7, 2018

We found this article, which does a good job of explaining the problem and talks about the kiam workaround:

https://medium.com/@pingles/kiam-iterating-for-security-and-reliability-5e793ab93ec3

#132 is indicating to our org that maybe we should look for alternatives. we plan on trying out kiam somewhat soon to see if it helps with this problem.

@roffe
Copy link

roffe commented Jun 7, 2018

i've found that postStart hooks in k8s makes this problem worse with kube2iam since the pod wont go into a running state until postStart is executed

@mikkeloscar
Copy link
Contributor

FYI: a discussion has been started in sig-aws about finding a common solution to this problem. See https://docs.google.com/document/d/1rn-v2TNH9k4Oz-VuaueP77ANE5p-5Ua89obK2JaArfg/edit?disco=AAAAB6DI_qM&ts=5b19085a for comparisons between existing projects.

@timm088
Copy link

timm088 commented Oct 23, 2018

For clarity, @mikkeloscar refers to the common solution about implementation and support for IAM in kubernetes on AWS (EKS or self run), not specifically, this issue around jobs and there race condition.

@mikkeloscar
Copy link
Contributor

For clarity, @mikkeloscar refers to the common solution about implementation and support for IAM in kubernetes on AWS (EKS or self run), not specifically, this issue around jobs and there race condition.

Yup. The point of my comment back then was that it's very hard to fix these sort of race conditions with the architecture of kube2iam. The discussion was started to approach the problem differently.

I've been working on a Proof of Concept (https://github.com/mikkeloscar/kube-aws-iam-controller) to eliminate all of these kinds of race conditions. Since the AWS SDKs handle credentials differently it currently doesn't work for all of them. It works for python and Java for now and I'm working with AWS to add support to Go as well (aws/aws-sdk-go#1993).

@WxFang
Copy link

WxFang commented Oct 8, 2019

Any update to this issue?

Jesse0Michael added a commit to leakytap/aws-cli-kube that referenced this issue Nov 22, 2022
create image from amazon/aws-cli with a run script that will ensure the aws credentials are available before executing the command.

script taken from:
jtblin/kube2iam#122 (comment)

related issue:
jtblin/kube2iam#136
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants