-
Notifications
You must be signed in to change notification settings - Fork 49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: It is too late and misread some stuff, so adjusting the text.
The current approach still has this race condition: #109 that happens to not be triggered due to bootkube insisting a lot (like 30 min if needed). Bootkube can't depend on DNS to be created, because they are defined out of the packet terraform module. However, etcd, etc. needs the DNS record to work to form an etcd cluster. And, due to the race condition, today we depend on NXDOMAIN caching being low for that domain for this to work (worst case). However, today it doesn't seem to cause many problem because bootkube will retry for ~30 minutes (as mentioned in the linked issue) and most of the times that is enough. But that is far from ideal and not very reliable, IMHO.
To solve that, why not move the DNS module instantiation inside the controllers? There might be several ways to do it, one way might be just very similar to what is using today, but the DNS module has a provider
(string) input param and uses something like count = var.provider == cloudflare? 1 : 0
or something alike to use the chosen provider Then, bootkube can depend on the DNS module IIUC, and we remove the race condition (the dependency can be added like it is today on AWS here that wasn't moved to the dns module and is explicit about the dependency, so it doesn't have any race condition).
What do you think? If you consider it out of scope, I'm fine. I wanted to mention because it seems you wanted to do some improvements to the DNS part and I think fixing the race condition is something completely required and with these changes is easier now =)
@rata while I agree this is not ideal, did you ever encounter some issues related to it or you know someone who did? I don't recall any issues around it.
This is not trivial, as Terraform providers cannot be dynamically provisioned. If they require credentials and they validate them during initialization, then one would have to provide credentials to all providers to use just one. I've tried that with AWS provider and this is not worth the effort IMO. The alternative would be to template Terraform code with Go, but this is not ideal in current situation either. We discussed various options with @mauriciovasquezbernal and @pothos when we were implementing that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work @johananl. I like the RP. I added some suggestions in the comments.
Do we want to validate CLOUDFLARE_API_TOKEN in our Go code? If so, where? It seems to me there isn't a good place in the code to do that currently. FWIW, failing to set this env var does produce a meaningful error (by Terraform).
I'm also not sure about that. For Tinkerbell PR (#382), I decided not to do that and leave Terraform to do it. The trade off is between the UX and maintenance of, possibly all authentication methods for all providers. AKS code also does not validate Azure credentials. On the other hand we still do that for Packet platform as far as I remember. I created #424 to discuss that. For now, I would say we shouldn't validate it.
assets/lokomotive-kubernetes/packet/flatcar-linux/kubernetes/bootkube.tf
Show resolved
Hide resolved
@rata thanks for the feedback. I think this race condition is a valid concern. Here are my thoughts:
That said, I see this issue as unrelated to this PR, so I wouldn't want to block merging because of it. I'd be fine with fixing this problem "on the go" if we had a trivial solution, however this doesn't seem to be the case. FWIW, I can think of several directions for addressing the issue:
|
b143660
to
87630d6
Compare
No, but I didn't try to break it either. Timing is helping us, and dns negative caching configuration too. If we try to break it we might break it :)
Ohh, good point. Thanks! Typical terraform... :-/ I think there are other tricks to do it, keeping the same structure (one provider per folder) than today. I've used such tricks in the past, for example, when there was no support for Thanks and sorry for the noise! :) |
c063455
to
8338d7e
Compare
I've handled all feedback AFAICT. CI is green. Need another round please. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments found by shellcheck. Otherwise it looks good!
assets/lokomotive-kubernetes/packet/flatcar-linux/kubernetes/cl/controller.yaml.tmpl
Outdated
Show resolved
Hide resolved
assets/lokomotive-kubernetes/packet/flatcar-linux/kubernetes/cl/controller.yaml.tmpl
Outdated
Show resolved
Hide resolved
assets/lokomotive-kubernetes/packet/flatcar-linux/kubernetes/cl/controller.yaml.tmpl
Outdated
Show resolved
Hide resolved
assets/lokomotive-kubernetes/packet/flatcar-linux/kubernetes/workers/cl/worker.yaml.tmpl
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Remove the `zone_id` knob. We can easily calculate this value and have no reason to bother the user with it. Remove the `aws_creds_path` knob under `dns.route53`. AFAICT the only case where this knob is useful is if I want to store Terraform state in S3 on AWS account A *and* I want to use Route 53 on AWS account B. To me it seems like a rather esoteric use case and therefore I think it's better to remove this knob for simplicity's sake. Convert dns.provider knob to a string. With the `zone_id` and `aws_creds_path` knobs removed, the `route53` block is now empty. Since the `manual` DNS options has no knobs and since `dns.zone` is provider-agnostic anyway, we can use a simple string to designate the selected provider. Convert dns.Validate() and dns.AskToConfigure() to methods on dns.Config. Run dns.Validate() during platform initialization. Running it in the Apply() method is too late and makes things break in an unexpected way when passing a wrong DNS provider in the config. Move DNS-related code outside of the Packet Terraform module for a better separation of concerns and clearer module boundaries. Instead of calculating the required DNS records inside the Packet module, return the *controller IPs* as an output to be consumed by the various DNS modules. This makes it easier to add new DNS providers and opens the way to supporting multiple DNS providers on platforms other than Packet. The "manual" DNS option is now a standalone Terraform module which doesn't create any resources and instead just returns outputs to be consumed by lokoctl. Remove trailing dot from kubeconfig URLs. Fixes #259.
DNS-related behavior varies in Lokomotive based on the chosen platform. In addition, multiple DNS providers are supported. This document lists the supported DNS providers. In addition, it describes what records are created in detail.
Although at the moment Cloudflare is supported only with Packet clusters, Cloudflare DNS support isn't strictly tied to Packet. Since the Packet quickstart guide is using Route 53 as the DNS provider, it makes sense to me to add the Cloudflare-related information as a how-to guide for the sake of keeping the quickstart simple and minimal. Putting this information in the configuration reference seemed strange to me since we aren't really providing information about config knobs but are rather giving a "recipe" for achieving a specific result (a cluster with Cloudflare DNS).
The DNS configuration syntax has been changed.
Since we have the same Terraform variables for all DNS modules, we can reduce duplication by symlinking a shared file.
Acronyms should have a consistent case according to the Go style guide.
0c27e60
to
bcb29e2
Compare
Ensure DNS records were created at the DNS provider before starting kubelet. This eliminates cases where kubelet attempts to resolve records before they were created. This results in negative DNS caching which in turn breaks the cluster bootstrap process.
bcb29e2
to
8c5b515
Compare
Thanks. Forgot about shellcheck 😳 Fixed now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
records = list(string) | ||
}) | ||
) | ||
provider "aws" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I don't think modules should be instantiating the providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module didn't run for me without this since the AWS provider requires a region and I need to instantiate the provider to specify the region AFAIK. Do you have an alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The instantiating code should also instantiate the provider as far as I know. So this block should be conditionally added together with module instantiation in Packet's Terraform template I guess. Something like:
provider "aws" {
...
}
module "dns" {
source = "..."
}
I'm not sure how big deal that is though. Perhaps we will find out, when porting those changes to AWS for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@invidian +1. The module should use the providers meta-argument if a custom one is needed. If not, it seems cleaner and according to terraform best practices to define it in the top level terraform file.
Should we create an issue to track this? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created #539.
General
This PR adds Cloudflare DNS support for Packet clusters.
The motivation behind this PR is to allow users to choose Cloudflare as their DNS provider for Lokomotive. Cloudflare is a very common DNS provider which isn't coupled with an IaaS provider (e.g. AWS/GCP). Supporting Cloudflare makes Lokomotive more flexible and usable by more people.
Fixes #259.
Implementation
While working on adding Cloudflare support I've refactored some aspects of our DNS handling, mostly out of necessity (i.e. I needed to refactor to get Cloudflare in) but partially out of opportunity (i.e. I've tried to simplify the way we deal with DNS while I'm there already).
This PR builds on top of the abstraction done in kinvolk-archives/lokomotive-kubernetes#137. While the changes currently apply only to Packet, this PR makes it relatively easy to extend the DNS abstraction to other platforms, which could enable choosing arbitrary combinations of platform + DNS provider. Extending the DNS abstraction to all platforms is out of scope for this PR.
Simplifying DNS-related Config Knobs
As part of this PR I'm proposing a simplification to the way we deal with DNS configuration. I'm aware this may be a "controversial" change so I'm pointing it out explicitly here.
In the current state, here is what the DNS config looks like for Packet clusters:
I find this structure much more complex than it needs to be. Here is why:
dns.route53.zone_id
can be easily calculated. There is no need to bother the user with it.dns.route53.aws_creds_path
is unnecessary because the user can use any of the standard AWS auth mechanisms to control the credentials used for Route 53 provisioning. The only case where that's not true is if the user wants to use S3 for state on account A and use Route 53 on account B. To me it seems like a highly-specialized case. Furthermore, even if we did want to allow configuring separate creds for S3 and for Route 53, it is not obvious to me that the user's first choice would be to point to a file containing the creds - a more natural choice would be to use env vars IMO.manual
has no knobs. Therefore, it doesn't need to be a block. The new DNS provider added here (cloudflare
) also doesn't need any knobs besides telling Lokomotive "use Cloudflare".dns.zone
is provider-agnostic: no matter which DNS provider we use, we would have to specify the base domain under which to create a cluster. This is also why it's directly underdns
and not insideroute53
for example.All of the above makes me think we can cover nearly all DNS use cases using two simple knobs:
Even if we insisted on keeping
dns
as a block, the following could do IMO:Any thoughts are welcome.
Testing
A Cloudflare account is required to test this PR.
To test this PR, follow the how-to guide added in this PR.
Open Questions
CLOUDFLARE_API_TOKEN
in our Go code? If so, where? It seems to me there isn't a good place in the code to do that currently. FWIW, failing to set this env var does produce a meaningful error (by Terraform).TODO
dig
hack for ARM.