Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideas for debugging a timeout? #675

Open
max-sixty opened this issue May 31, 2020 · 6 comments
Open

Ideas for debugging a timeout? #675

max-sixty opened this issue May 31, 2020 · 6 comments

Comments

@max-sixty
Copy link
Contributor

I have an issue that is probably not the fault of sops, but I can only replicate with sops — and so if anyone has hit something similar, or has thoughts on how I could debug this further, I'd appreciate any insights.

I'm using sops on a GKE cluster set up with Workload Identity. The following command fails:

kubectl run -it    --image gcr.io/[...] \
                      --serviceaccount argo-service-account \
                      --namespace default \
                      --rm \
                      -- test-pod bash -c 'sops -d [...]'

...with...

If you don't see a command prompt, try pressing enter.
Failed to get the data key required to decrypt the SOPS file.

Group 0: FAILED
  projects/[...]/locations/us-west2/keyRings/[...]: FAILED
    - | Error decrypting key: Post
      | https://cloudkms.googleapis.com/v1/projects/[...]:decrypt?alt=json&prettyPrint=false:
      | Get
      | http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform:
      | net/http: timeout awaiting response headers

Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.

But the following command succeeds:

kubectl run -it    --image gcr.io/[...] \
                      --serviceaccount argo-service-account \
                      --namespace default \
                      --rm \
                      -- test-pod bash -c 'sleep 5 sops -d [...]' # <- note `sleep 5`

...so adding a sleep 5 remedies the failure. I've replicated a few times to ensure this is the reason. I can't replicate the failures with any calls to gcloud kms encrypt; I only get this behavior with sops.

Potentially sops is "too fast" and issues a request before GKE has had the chance to set up permissions for the pod?

Besides adding sleep 5 to every instantiation, is there any way to allow for a longer timeout with sops? Or any other ideas for debugging?

Thank you!

@autrilla
Copy link
Contributor

Looks like that's the case, yeah. From https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#limitations:

The GKE Metadata Server takes a few seconds to start to run on a newly created pod. Therefore, attempts to authenticate or authorize using Workload Identity made within the first few seconds of a pod's life may fail. Retrying the call will resolve the problem.

I'm guessing the gcloud CLI has some retry logic? Maybe the SDK had a recent update that adds this retry logic as well, so we might be able to fix this by using the latest SDK version.

@max-sixty
Copy link
Contributor Author

Ah, great find, thanks @autrilla

I'm not sure how sops calls the SDK — I'm using the latest gcloud version on my end, but to the extent it's called directly — your suggestion sounds great.

Feel free to close the issue / repurpose to the SDK update.

@autrilla
Copy link
Contributor

I'm not sure it would actually fix it. Would it be possible for you to build SOPS locally, but with https://github.com/mozilla/sops/blob/master/go.mod#L6 to v0.57.0, which appears to be the latest version, and then try and see if that fixes it? I kind of doubt it, since there's no mention of anything like that in the changelog.

IMO this is something the SDK should handle, but if they're unwilling to, we could retry on the SOPS side when we hit that error.

@max-sixty
Copy link
Contributor Author

Yes, on master or develop?

@autrilla
Copy link
Contributor

develop ideally

@max-sixty
Copy link
Contributor Author

Unfortunately I get the same result:

kubectl run -it \  
                       --image gcr.io/[...]:sops-test \
                       --serviceaccount argo-service-account \
                       --namespace default \
                       --rm \
                       -- test-pod2 bash -c '/go/bin/sops -d [...]'
If you don't see a command prompt, try pressing enter.
Failed to get the data key required to decrypt the SOPS file.

Group 0: FAILED
  projects/[...]: FAILED
    - | Error decrypting key: Post
      | https://cloudkms.googleapis.com/v1/projects/[...]?alt=json&prettyPrint=false:
      | Get
      | http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform:
      | net/http: timeout awaiting response headers

Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.

If there's a command I can run to check I have the updated sdk version on that pod, let me know. I haven't used go before, though everything went fairly smoothly. Here's the branch I'm using: develop...max-sixty:update-gcloud-sdk

I agree it seems like a problem gcloud should solve rather than sops. For the moment, I'll add a sleep & cache into my application code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants