-
-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tls: Restructure and improve certificate management #2015
Conversation
- Expose the list of Caddy instances through caddy.Instances() - Added arbitrary storage to caddy.Instance - The cache of loaded certificates is no longer global; now scoped per-instance, meaning upon reload (like SIGUSR1) the old cert cache will be discarded entirely, whereas before, aggressively reloading config that added and removed lots of sites would cause unnecessary build-up in the cache over time. - Key certificates in the cache by their SHA-256 hash instead of by their names. This means certificates will not be duplicated in memory (within each instance), making Caddy much more memory-efficient for large-scale deployments with thousands of sites sharing certs. - Perform name-to-certificate lookups scoped per caddytls.Config instead of a single global lookup. This prevents certificates from stepping on each other when they overlap in their names. - Do not allow TLS configurations keyed by the same hostname to be different; this now throws an error. - Updated relevant tests, with a stark awareness that more tests are needed. - Change the NewContext function signature to include an *Instance. - Strongly recommend (basically require) use of caddytls.NewConfig() to create a new *caddytls.Config, to ensure pointers to the instance certificate cache are initialized properly. - Update the TLS-SNI challenge solver (even though TLS-SNI is disabled currently on the CA side). Store temporary challenge cert in instance cache, but do so directly by the ACME challenge name, not the hash. Modified the getCertificate function to check the cache directly for a name match if one isn't found otherwise. This will allow any caddytls.Config to be able to help solve a TLS-SNI challenge, with one extra side-effect that might actually be kind of interesting (and useless): clients could send a certificate's hash as the SNI and Caddy would be able to serve that certificate for the handshake. - Do not attempt to match a "default" (random) certificate when SNI is present but unrecognized; return no certificate so a TLS alert happens instead. - Store an Instance in the list of instances even while the instance is still starting up (this allows access to the cert cache for performing renewals at startup, etc). Will be removed from list again if instance startup fails. - Laid groundwork for ACMEv2 and Let's Encrypt wildcard support. Server type plugins will need to be updated slightly to accommodate minor adjustments to their API (like passing in an Instance). This commit includes the changes for the HTTP server. Certain Caddyfile configurations might error out with this change, if they configured different TLS settings for the same hostname. This change trades some complexity for other complexity, but ultimately this new complexity is more correct and robust than earlier logic. Fixes #1991 Fixes #1994 Fixes #1303
I had a few questions/ideas as to what this means for Caddy in the future. I'm not really reviewing the code, so I'm sorry if I'm asking things that are painfully obvious by looking at the code.
This means we'll now have support for plugins adding their own storage drivers, correct? E.g. Consul (see #1918) - for situations where Caddy is a load balancer, there needs to be a way to ensure instances don't conflict with eachother when attempting to renew. Which brings me to my next question:
So this is kinda part two of the requirements in #1918, locking. This PR doesn't implement locking for renewals, based on your description, right? For that usecase you described, it sounds like a lock file would be ideal (that could contain a timestamp as well to indicate the last time renewal was attempted) to alert other instances that renewal is already happening, and those other instances could give up. Maybe that's out of scope for this PR though. A lock interface would help for implementing Consul or other storage drivers.
Maybe I'm getting confused by the terminology here, but does I ask that, because:
So does each running instance now know about the other instances on the same machine? If so, neat!
This sounds like a pretty good case for recommending using the new snippet syntax (#1921) to avoid duplicated TLS blocks. |
@francislavoie Great thoughts, let me respond to each one:
No, sorry for the confusion. This storage is simply a
Right: this PR does not totally solve the issue of shared, managed certificates among multiple Caddy instances. However, it should make any problem virtually go away as long as the instances are staggered in their execution by a few seconds to a few minutes, and as long as there aren't too many of them (hundreds or more might be too many to be safe). The TLS storage stuff I am working on in my head should introduce locking and do it properly, but those will be plugins. This solution is baked right into Caddy and should "just work" for the vast majority of people who need something like it.
Each Caddy process instantiates a literal
In the same process, yes. If they have to know about them, they can. The updated certificate maintenance routines run one-per-process and have to iterate the list of Instances in order to access their individual certificate caches. Then they iterate the list of certificates and perform renewals, etc.
Good point. Thanks for your comments! I hope this makes a little more sense! |
Ah, yeah that helps. I got a little excited that this would cover #1918 because there's a few scenarios I ran into that it would really help with, i.e. load balancing your load balancers, or having certs managed by the underlying instances that are being load balanced. That's currently not possible with Caddy because they don't share certs - if one renews, the other won't know about it unless they share the same filesystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without further ado, I will approve this PR.
It is well written and documented!!
Have a wonderful day :)
👍 That also means that incoming requests would be blocked until the TLS renew lock is removed right? |
I wouldn't go too far with the quoted suggestion (no offense Francis, ha) -- what we'd rather have is a truly atomic "create-file-if-not-exists" operation, provided by the OS. Right now, Caddy does not utilize any such features, which is why cert management, even with this PR, is not truly synchronized. And no, incoming requests are NOT blocked while certificate maintenance happens. Maintenance happens in the background and is entirely non-blocking after Caddy starts up. |
Also introduce caddy.OnProcessExit which is a list of functions that run before exiting the process cleanly; these do not count as shutdown callbacks, so they do not return errors and must execute quickly.
# Conflicts: # sigtrap_posix.go
@nodesocket I just pushed some changes which make Caddy synchronize its renewals using the file system, so multiple instances can run at the same time and safely coordinate renewals. You won't have to stagger the execution of Caddy processes anymore, and this should scale well to thousands of instances. Would you please test this and report back? (NOTE: Use the DNS challenge when behind a load balancer.) |
@mholt thanks for this. Awesome. Can you give a brief overview of how it works? I assume still need to use a shared file system like NFS (Amazon Elastic File System) correct? This is not doing any rsync magic is it? |
@nodesocket It should "just work" - yes, it'll use the file system. No rsync; the .caddy folder (well, $CADDYPATH) must be locally mounted. This design may not be 100% perfect -- I'm guessing more like 99.5% or thereabouts, assuming there are no implementation bugs (ha!) -- but it doesn't need to be perfect, since rate limits have a little bit of tolerance. By the way, if your company is relying on Caddy, I highly suggest ordering extended support. If you encounter any bugs, it'll ensure a quicker resolution for your business. 👍 |
@mholt can you provide instructions on how to build this? I assume I need to compile myself? |
Yes :) Just like you've been building it up to now, except on this branch instead of master. (If you haven't been building from source, then you should get a commercial license if you haven't already.) The build instructions are in the README -- just switch to this branch. |
Did some testing between this branch and master, with the following caddyfile
I find that when I visit |
@dschwar Thanks for trying it out! What's the |
After removing the mysql plugin, I tested master vs your branch. On master it loads the provided cert of
The weird thing is, if I browse a site like |
Thanks for giving it another go! We're getting closer. First thing -- and you should do this immediately -- use the Let's Encrypt staging endpoint for testing! Set it with the Are you running caddy as the same user as who owns that directory, etc? Also, is this subdomain new (i.e. does the containing folder listed in the error message already exist on disk)? |
Yup
The containing folder doesn't exist (see output pasted above). Should caddy be trying to provision a LE cert for |
Now we're getting somewhere. So there's 2 things going on: opening a lock file where the parent dir doesn't exist, and the subdomain thing. I'm looking into it... |
@dschwar PS. If you want to help dive into the code and help debug it (since I have to work to reproduce it) -- specifically help with the problem of serving the wrong certificate -- I would really appreciate it. And it would help us get it fixed sooner. You'd want to look in |
@mholt Let's Encrypt staging is still timing-out for me. I can't believe others are not encountering this after 4+ hours. Just curious, are you testing from AWS EC2 instances? I am. |
This seems to happen because I have the command line I tried removing the So I tried changing the |
Nope, not using EC2. Thanks for commenting on their forums -- hopefully they'll find it soon. Not many people use staging except client developers I think. @dschwar -- okay, I think what you'll need to do is simplify your config down to bare-bones. Run caddy without any extra command line args (except to set the staging endpoint!) and then run Caddy again with a simple Caddyfile. I can help debug it but only if I have all the information: how Caddy was run, what env variables are set, the Caddyfile, log output, requests that are made, etc -- all unredacted and unmodified. Otherwise, I can only give you tips to help you debug it on your end. (The |
@mholt I can test this tomorrow but my current hypothesis is that if the ‘-host’ flag is supplied and there is a Caddyfile with a supplied tls cert for that host and a wildcard on-demand config, then when you visit the host, it will choose to generate an on demand cert.
… On Feb 15, 2018, at 12:23 AM, Matt Holt ***@***.***> wrote:
@nodesocket
Let's Encrypt staging is still timing-out for me. I can't believe others are not encountering this after 4+ hours. Just curious, are you testing from AWS EC2 instances? I am.
Nope, not using EC2. Thanks for commenting on their forums -- hopefully they'll find it soon. Not many people use staging except client developers I think.
@dschwar -- okay, I think what you'll need to do is simplify your config down to bare-bones. Run caddy without any extra command line args (except to set the staging endpoint!) and then run Caddy again with a simple Caddyfile. I can help debug it but only if I have all the information: how Caddy was run, what env variables are set, the Caddyfile, log output, requests that are made, etc -- all unredacted and unmodified. Otherwise, I can only give you tips to help you debug it on your end.
(The -host flag, per the docs specifies a default host to listen on; so if the hostname defined in the Caddyfile is empty, it will be filled in with -host. And https://... is not a host, of course. Remember, Caddy is very easy to use, but has enough power that you can easily shoot yourself in the foot. Unlike other web servers, with Caddy, you'll want to trim back the config when things start getting complicated.)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@dschwar Ahhhh, wait, it's only generating the certificate on-demand (i.e. not at startup)? Is |
@dschwar Ahhhh, wait, it's only generating the certificate on-demand (i.e. not at startup)?
Yup, only on demand. Should have mentioned that earlier, sorry.
Is subdomain.host.com that you visit in your browser the exact same one as the first site in your Caddyfile, subdomain.host.com:443?
Yup. When redacting, I made sure everything matches up and is in the same form.
hostname (:443 for example)
I thought a :443 would give me a wildcard host? So that I can generate certs for any domain that points at my server (hence why I redact, don’t want spam)
… On Feb 15, 2018, at 12:51 AM, Matt Holt ***@***.***> wrote:
@dschwar Ahhhh, wait, it's only generating the certificate on-demand (i.e. not at startup)? Is subdomain.host.com that you visit in your browser the exact same one as the first site in your Caddyfile, subdomain.host.com:443? If so, then what you found may not be a bug with the TLS changes. If you set -host subdomain.host.com and then define a site in the Caddyfile with an empty hostname (:443 for example) then that site is assigned a default hostname of subdomain.host.com according to your flag. Thus there is a duplicate site name and that should be caught at startup. Let me know what you think.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@dschwar Thanks for clarifying.
Well, the
Won't matter -- all hostnames with Let's Encrypt certificates go to public CT logs anyway. In fact, soon all certificates will need to be submitted to CT. Anyway, I'm thinking of making your situation a "collision" at startup, because otherwise it's not obvious. In other words, using |
See discussion on #2015 for how this situation was discovered. For a Caddyfile like this: localhost { ... } :2015 { ... } Running Caddy like this: caddy -host localhost Produces two sites both defined as `localhost:2015` because the flag changes the default host value to be `localhost`. This should be an error since the sites are not distinct and it is confusing. It can also cause issues with TLS handshakes loading the wrong cert, as the linked discussion shows.
@dschwar I've pushed that fix to this branch; feel free to try it out and let me know. You should now get an error when you try to start Caddy with the configuration you had. The reason you were getting the wrong certificate is because Caddy couldn't disambiguate between the two site configurations because of your |
I think I'm missing something now. I've removed the and the logs show
|
@dschwar Thanks, I think I'm finally onto a fix. Just a hunch, can you verify that replacing the site named |
Seems to do the same thing |
See discussion on #2015; the initial change had removed this check, and I can't remember why I removed it or if it was accidental. Anyway, it's back now.
@dschwar Hmm well that's discouraging. Are you sure? If you could pull the latest and try again -- with and without the |
It works! without the Caddyfile
Thanks so much 🎉 |
@dschwar Great!! It should work with the I'll merge this PR as Go 1.10 is released, unless someone speaks up with another bug report before then. |
Testing that (for fun), when you say it should with
or
I just attempted the first one and it doesn't work for me. |
The first one. That should work, (it works for me), so if you want to investigate feel free! |
It seems to work if I remove the
with the command line args |
Only strip the port from the Location URL value if the port is NOT the HTTPSPort (before, we compared against DefaultHTTPSPort instead of HTTPSPort). The HTTPSPort can be changed, but is done so for port forwarding, since in reality you can't 'change' the standard HTTPS port, you can only forward it.
@dschwar Ah, you know what, I'm sorry -- that is correct and expected behavior; I was wrong before. Wildcard characters ( So actually I think that remaining "bug" is a non-issue. (PS. You can't serve tls on port 80 with Caddy, so you can remove that from your configuration.) However, the investigation led me to fix what I think was some logical inconsistencies in the code, so it was all good. I'm gonna merge this real soon. Probably today. |
Please help test and review this PR!
1. What does this change do, exactly?
Most of the changes are internal to the code, but there are bug fixes and a couple minor breaking changes for server type plugins that use the
caddytls
package.Expose the list of Caddy instances through caddy.Instances()
Added arbitrary storage to caddy.Instance
The cache of loaded certificates is no longer global; now scoped
per-instance, meaning upon reload (like SIGUSR1) the old cert cache
will be discarded entirely, whereas before, aggressively reloading
config that added and removed lots of sites would cause unnecessary
build-up in the cache over time.
Key certificates in the cache by their SHA-256 hash instead of
by their names. This means certificates will not be duplicated in
memory (within each instance), making Caddy much more memory-efficient
for large-scale deployments with thousands of sites sharing certs.
Perform name-to-certificate lookups scoped per caddytls.Config instead
of a single global lookup. This prevents certificates from stepping on
each other when they overlap in their names.
Do not allow TLS configurations keyed by the same hostname to be
different; this now throws an error.
Updated relevant tests, with a stark awareness that more tests are
needed.
Change the NewContext function signature to include an *Instance.
Strongly recommend (basically require) use of caddytls.NewConfig()
to create a new *caddytls.Config, to ensure pointers to the instance
certificate cache are initialized properly.
Update the TLS-SNI challenge solver (even though TLS-SNI is disabled
currently on the CA side). Store temporary challenge cert in instance
cache, but do so directly by the ACME challenge name, not the hash.
Modified the getCertificate function to check the cache directly for
a name match if one isn't found otherwise. This will allow any
caddytls.Config to be able to help solve a TLS-SNI challenge, with one
extra side-effect that might actually be kind of interesting (and
useless): clients could send a certificate's hash as the SNI and
Caddy would be able to serve that certificate for the handshake.
Do not attempt to match a "default" (random) certificate when SNI
is present but unrecognized; return no certificate so a TLS alert
happens instead.
Store an Instance in the list of instances even while the instance
is still starting up (this allows access to the cert cache for
performing renewals at startup, etc). Will be removed from list again
if instance startup fails.
Laid groundwork for ACMEv2 and Let's Encrypt wildcard support.
Server type plugins will need to be updated slightly to accommodate
minor adjustments to their API (like passing in an Instance). This
commit includes the changes for the HTTP server.
Certain Caddyfile configurations might error out with this change, if
they configured different TLS settings for the same hostname.
This change trades some complexity for other complexity, but ultimately
this new complexity is more correct and robust than earlier logic.
2. Please link to the relevant issues.
Fixes #1991
Fixes #1994
Fixes #1303
3. Which documentation changes (if any) need to be made because of this PR?
The
tls
directive page will definitely need updating. TLS configurations for equal hostnames cannot be different. Any description about how certificates are stored internally might need to change.OH, and something I forgot on the original commit message: Caddy can now better support being used in conjunction with other Caddy instances that share the same certificate storage. In other words, if you have multiple instances all using the same $CADDYPATH, that's been a bad idea because they will all try to manage the certificates themselves (keep them renewed, etc, which runs you into rate limits fast). Now, Caddy will attempt to check if the certificate in storage still needs to be renewed just before it fires off a renewal. So if you stagger Caddy deployments by more time than it takes to complete an ACME challenge (usually just a few seconds, but if you use DNS challenge maybe minutes or more), they might play better together in an acceptable way, for small-medium scale use cases. I wouldn't rely on this for high-volume deployments (like, hundreds or thousands of instances sharing the same certificate storage and all trying to manage them) but I bet it'll be better for dozens of Caddy instances!
** UPDATE **: The latest changes I pushed to this branch now eliminate the need to stagger the timing of running each Caddy instance. If multiple Caddy instances (even thousands) share the same certificate storage on disk, they will coordinate renewals and not step on each others' toes. It just requires using the DNS challenge when behind a load balancer. However, testing this is a little tricky, so we'll have to see how it goes. It's an advanced use case, so I'm not expecting many people to be benefitted by it, but I hope they will let us know how it goes!
4. Checklist
/cc @pieterlouw and @miekg -- I'll be getting in touch with you soon with more details for what this means about your server type plugins. I can probably submit a PR to caddy-net, and I'm not sure that CoreDNS will have to change much except a line where
NewContext()
is invoked.