-
-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling canceled contexts in Unlock
in custom storage implementations
#247
Comments
I found the underlying cause of the canceled context by now: The background renewal of a not-yet-expired certificate is using the same context as was given to See also #248: The background renewal should use a different context, IMHO it is wrong to assume that it can use the one given to |
Good question! Hmm. Part of me thinks, ignore the context because the point is to cleanup after the context is cancelled. The other part of me thinks, if the context is cancelled, it's important to abort to avoid a goroutine leak (or something). Or maybe we SHOULD make a new context that isn't canceled, but has a timeout, so that unlock can still finish, but the time is set. Is that reasonable? Or do you think the storage implementation should have control of the timeout and set its own when it encounters an already-canceled context?
I think I agree. Hadn't really considered this, but a background routine indeed should not be confined to a single handshake. But why haven't we encountered this problem before? 🤔 (rhetorical question, and mostly for me) |
I think I have seen something like this when reading other golang code, but I cannot remember where and point to that pattern. In a way it sounds reasonable, and I don't think that I-as-storage should be allowed to take ages to cleanup. But then again, I couldn't come up with a good timeout here -- 1s? 10s? 10% of the available timeout? Incidentally, before I forget this yet again: Line 49 in 8fac0d0
ctx := clientHello.Context() . Can file PR if that makes sense, but it has this risk that there's another place where this now would break because we missed something somewhere.
... and that might answer your rhetorical question: Because actually |
Go 1.21 adds |
I somehow missed that Go 1.17 added that. Scanning the code, I think that is a correct solution: the context should be limited to the handshake. If the client aborts the handshake, the server should abort whatever it's doing. UNLESS we start a background maintenance goroutine, in which case we should probably use a different context (done in #248). |
As for the context used for locking... I feel like the most correct thing to do is to pass a fresh new context with our own timeout. Of course, the obvious question is, what should the duration be? I don't know of course... but maybe like 10s? I made a Twitter poll just for fun: https://twitter.com/mholt6/status/1689331972114690048 -- but admittedly it's a niche/obscure question. What do you think, would that work for your deployment? |
Quoting myself a bit:
I don't think there's any good timeout that certmagic would be able to define, and there's also no good sane way to ask for one in the configuration ("unlock-timeout-when-using-a-weird-storage"?), so it's probably the storage side that needs to know what makes sense. The referenced discussion reads like "just stop and abandon properly, the context is now canceled and there's nothing wrong with that" -- and in that case I as storage should actually pass a new context to my functions that do funky I/O. Or in other words: If this would be not a function but inlined, then I would still have to call my I/O-y function, and I would have to pass a new context latest at that point. (Disclaimer: Way out of my depth here, too, let's see what the audience joker gives :D)
In our specific case a 10s timeout should be plenty to basically do a DynamoDB |
Ok, actually, I think your logic makes sense. I'll change our code to pass in a fresh context that is not canceled and has no timeout. If we encounter problems with hanging unlocks, we can enforce a timeout. (Or the implementation just needs to be fixed; either way.) |
* Use context from ClientHello during GetCertificate (see #247) * Avoid recursive ops during on-demand issuance
What is your question?
This relates to this piece of code:
certmagic/config.go
Lines 504 to 517 in 51b3190
Basically: It is possible that locking works, but the context gets subsequently canceled for whatever reason. The deferred
releaseLock
call in line 511 will call our custom storage implementation'sUnlock
with the now-canceled context. Right now that leads to an error in our code, and eventually our lock will expire and we can continue.Is there anything we can do better here? For instance, would it be reasonable to pass another context into
releaseLock
/Unlock
that is not canceled? Or should we "ignore" the cancelation on our side?For reference: Our storage uses DynamoDB, and
Lock
andUnlock
basically modify a specific item in it. We pass the context through to the AWS SDK DynamoDB call, which will recognize it is canceled and not execute the actual request.The text was updated successfully, but these errors were encountered: