Ideas for debugging a timeout? #675

max-sixty · 2020-05-31T19:48:52Z

I have an issue that is probably not the fault of sops, but I can only replicate with sops — and so if anyone has hit something similar, or has thoughts on how I could debug this further, I'd appreciate any insights.

I'm using sops on a GKE cluster set up with Workload Identity. The following command fails:

kubectl run -it    --image gcr.io/[...] \
                      --serviceaccount argo-service-account \
                      --namespace default \
                      --rm \
                      -- test-pod bash -c 'sops -d [...]'

...with...

If you don't see a command prompt, try pressing enter.
Failed to get the data key required to decrypt the SOPS file.

Group 0: FAILED
  projects/[...]/locations/us-west2/keyRings/[...]: FAILED
    - | Error decrypting key: Post
      | https://cloudkms.googleapis.com/v1/projects/[...]:decrypt?alt=json&prettyPrint=false:
      | Get
      | http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform:
      | net/http: timeout awaiting response headers

Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.

But the following command succeeds:

kubectl run -it    --image gcr.io/[...] \
                      --serviceaccount argo-service-account \
                      --namespace default \
                      --rm \
                      -- test-pod bash -c 'sleep 5 sops -d [...]' # <- note `sleep 5`

...so adding a sleep 5 remedies the failure. I've replicated a few times to ensure this is the reason. I can't replicate the failures with any calls to gcloud kms encrypt; I only get this behavior with sops.

Potentially sops is "too fast" and issues a request before GKE has had the chance to set up permissions for the pod?

Besides adding sleep 5 to every instantiation, is there any way to allow for a longer timeout with sops? Or any other ideas for debugging?

Thank you!

The text was updated successfully, but these errors were encountered:

autrilla · 2020-05-31T20:16:29Z

Looks like that's the case, yeah. From https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#limitations:

The GKE Metadata Server takes a few seconds to start to run on a newly created pod. Therefore, attempts to authenticate or authorize using Workload Identity made within the first few seconds of a pod's life may fail. Retrying the call will resolve the problem.

I'm guessing the gcloud CLI has some retry logic? Maybe the SDK had a recent update that adds this retry logic as well, so we might be able to fix this by using the latest SDK version.

max-sixty · 2020-05-31T20:29:37Z

Ah, great find, thanks @autrilla

I'm not sure how sops calls the SDK — I'm using the latest gcloud version on my end, but to the extent it's called directly — your suggestion sounds great.

Feel free to close the issue / repurpose to the SDK update.

autrilla · 2020-05-31T21:05:59Z

I'm not sure it would actually fix it. Would it be possible for you to build SOPS locally, but with https://github.com/mozilla/sops/blob/master/go.mod#L6 to v0.57.0, which appears to be the latest version, and then try and see if that fixes it? I kind of doubt it, since there's no mention of anything like that in the changelog.

IMO this is something the SDK should handle, but if they're unwilling to, we could retry on the SOPS side when we hit that error.

max-sixty · 2020-05-31T21:11:21Z

Yes, on master or develop?

autrilla · 2020-05-31T21:16:53Z

develop ideally

max-sixty · 2020-05-31T22:13:50Z

Unfortunately I get the same result:

kubectl run -it \  
                       --image gcr.io/[...]:sops-test \
                       --serviceaccount argo-service-account \
                       --namespace default \
                       --rm \
                       -- test-pod2 bash -c '/go/bin/sops -d [...]'
If you don't see a command prompt, try pressing enter.
Failed to get the data key required to decrypt the SOPS file.

Group 0: FAILED
  projects/[...]: FAILED
    - | Error decrypting key: Post
      | https://cloudkms.googleapis.com/v1/projects/[...]?alt=json&prettyPrint=false:
      | Get
      | http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token?scopes=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform:
      | net/http: timeout awaiting response headers

Recovery failed because no master key was able to decrypt the file. In
order for SOPS to recover the file, at least one key has to be successful,
but none were.

If there's a command I can run to check I have the updated sdk version on that pod, let me know. I haven't used go before, though everything went fairly smoothly. Here's the branch I'm using: develop...max-sixty:update-gcloud-sdk

I agree it seems like a problem gcloud should solve rather than sops. For the moment, I'll add a sleep & cache into my application code.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideas for debugging a timeout? #675

Ideas for debugging a timeout? #675

max-sixty commented May 31, 2020

autrilla commented May 31, 2020

max-sixty commented May 31, 2020

autrilla commented May 31, 2020

max-sixty commented May 31, 2020

autrilla commented May 31, 2020

max-sixty commented May 31, 2020

Ideas for debugging a timeout? #675

Ideas for debugging a timeout? #675

Comments

max-sixty commented May 31, 2020

autrilla commented May 31, 2020

max-sixty commented May 31, 2020

autrilla commented May 31, 2020

max-sixty commented May 31, 2020

autrilla commented May 31, 2020

max-sixty commented May 31, 2020