Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Packet: Add Cloudflare DNS support #422

Merged
merged 10 commits into from
May 29, 2020
Merged

Conversation

johananl
Copy link
Member

@johananl johananl commented May 12, 2020

General

This PR adds Cloudflare DNS support for Packet clusters.

The motivation behind this PR is to allow users to choose Cloudflare as their DNS provider for Lokomotive. Cloudflare is a very common DNS provider which isn't coupled with an IaaS provider (e.g. AWS/GCP). Supporting Cloudflare makes Lokomotive more flexible and usable by more people.

Fixes #259.

Implementation

While working on adding Cloudflare support I've refactored some aspects of our DNS handling, mostly out of necessity (i.e. I needed to refactor to get Cloudflare in) but partially out of opportunity (i.e. I've tried to simplify the way we deal with DNS while I'm there already).

This PR builds on top of the abstraction done in kinvolk-archives/lokomotive-kubernetes#137. While the changes currently apply only to Packet, this PR makes it relatively easy to extend the DNS abstraction to other platforms, which could enable choosing arbitrary combinations of platform + DNS provider. Extending the DNS abstraction to all platforms is out of scope for this PR.

Simplifying DNS-related Config Knobs

As part of this PR I'm proposing a simplification to the way we deal with DNS configuration. I'm aware this may be a "controversial" change so I'm pointing it out explicitly here.

In the current state, here is what the DNS config looks like for Packet clusters:

cluster "packet" {
  ...
    dns {
      zone = "example.com"

      provider {
        route53 {
          zone_id        = "xxxxxxxx"
          aws_creds_path = "/tmp/aws_creds"
        }
      }
      # manual {}
    }
  ...
}

I find this structure much more complex than it needs to be. Here is why:

  • dns.route53.zone_id can be easily calculated. There is no need to bother the user with it.
  • dns.route53.aws_creds_path is unnecessary because the user can use any of the standard AWS auth mechanisms to control the credentials used for Route 53 provisioning. The only case where that's not true is if the user wants to use S3 for state on account A and use Route 53 on account B. To me it seems like a highly-specialized case. Furthermore, even if we did want to allow configuring separate creds for S3 and for Route 53, it is not obvious to me that the user's first choice would be to point to a file containing the creds - a more natural choice would be to use env vars IMO.
  • manual has no knobs. Therefore, it doesn't need to be a block. The new DNS provider added here (cloudflare) also doesn't need any knobs besides telling Lokomotive "use Cloudflare".
  • dns.zone is provider-agnostic: no matter which DNS provider we use, we would have to specify the base domain under which to create a cluster. This is also why it's directly under dns and not inside route53 for example.

All of the above makes me think we can cover nearly all DNS use cases using two simple knobs:

cluster "packet" {
  ...
  dns_zone     = "example.com"
  dns_provider = "route53" # Alternatively "cloudflare" or "manual"
  ...
}

Even if we insisted on keeping dns as a block, the following could do IMO:

cluster "packet" {
  ...
  dns {
    zone     = "example.com"
    provider = "route53"
  }
  ...
}

Any thoughts are welcome.

Testing

A Cloudflare account is required to test this PR.

To test this PR, follow the how-to guide added in this PR.

Open Questions

  • Do we want to validate CLOUDFLARE_API_TOKEN in our Go code? If so, where? It seems to me there isn't a good place in the code to do that currently. FWIW, failing to set this env var does produce a meaningful error (by Terraform).

TODO

  • Open a follow up issue for extending the DNS abstraction to all supported platforms.
  • Open a follow up issue for disabling dig hack for ARM.

Copy link
Member

@rata rata left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: It is too late and misread some stuff, so adjusting the text.

The current approach still has this race condition: #109 that happens to not be triggered due to bootkube insisting a lot (like 30 min if needed). Bootkube can't depend on DNS to be created, because they are defined out of the packet terraform module. However, etcd, etc. needs the DNS record to work to form an etcd cluster. And, due to the race condition, today we depend on NXDOMAIN caching being low for that domain for this to work (worst case). However, today it doesn't seem to cause many problem because bootkube will retry for ~30 minutes (as mentioned in the linked issue) and most of the times that is enough. But that is far from ideal and not very reliable, IMHO.

To solve that, why not move the DNS module instantiation inside the controllers? There might be several ways to do it, one way might be just very similar to what is using today, but the DNS module has a provider(string) input param and uses something like count = var.provider == cloudflare? 1 : 0 or something alike to use the chosen provider Then, bootkube can depend on the DNS module IIUC, and we remove the race condition (the dependency can be added like it is today on AWS here that wasn't moved to the dns module and is explicit about the dependency, so it doesn't have any race condition).

What do you think? If you consider it out of scope, I'm fine. I wanted to mention because it seems you wanted to do some improvements to the DNS part and I think fixing the race condition is something completely required and with these changes is easier now =)

@invidian
Copy link
Member

But that is far from ideal and not very reliable, IMHO.

@rata while I agree this is not ideal, did you ever encounter some issues related to it or you know someone who did? I don't recall any issues around it.

To solve that, why not move the DNS module instantiation inside the controllers? There might be several ways to do it, one way might be just very similar to what is using today, but the DNS module has a provider(string) input param and uses something like count = var.provider == cloudflare? 1 : 0 or something alike to use the chosen provider Then, bootkube can depend on the DNS module IIUC, and we remove the race condition (the dependency can be added like it is today on AWS here that wasn't moved to the dns module and is explicit about the dependency, so it doesn't have any race condition).

This is not trivial, as Terraform providers cannot be dynamically provisioned. If they require credentials and they validate them during initialization, then one would have to provide credentials to all providers to use just one. I've tried that with AWS provider and this is not worth the effort IMO.

image

The alternative would be to template Terraform code with Go, but this is not ideal in current situation either. We discussed various options with @mauriciovasquezbernal and @pothos when we were implementing that.

Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @johananl. I like the RP. I added some suggestions in the comments.

Do we want to validate CLOUDFLARE_API_TOKEN in our Go code? If so, where? It seems to me there isn't a good place in the code to do that currently. FWIW, failing to set this env var does produce a meaningful error (by Terraform).

I'm also not sure about that. For Tinkerbell PR (#382), I decided not to do that and leave Terraform to do it. The trade off is between the UX and maintenance of, possibly all authentication methods for all providers. AKS code also does not validate Azure credentials. On the other hand we still do that for Packet platform as far as I remember. I created #424 to discuss that. For now, I would say we shouldn't validate it.

assets/lokomotive-kubernetes/dns/manual/main.tf Outdated Show resolved Hide resolved
assets/lokomotive-kubernetes/dns/route53/main.tf Outdated Show resolved Hide resolved
assets/lokomotive-kubernetes/dns/manual/main.tf Outdated Show resolved Hide resolved
pkg/dns/dns.go Show resolved Hide resolved
pkg/platform/packet/packet.go Outdated Show resolved Hide resolved
pkg/platform/packet/template.go Outdated Show resolved Hide resolved
pkg/platform/packet/template.go Outdated Show resolved Hide resolved
pkg/platform/packet/template.go Outdated Show resolved Hide resolved
docs/concepts/dns.md Outdated Show resolved Hide resolved
@johananl
Copy link
Member Author

johananl commented May 13, 2020

@rata thanks for the feedback. I think this race condition is a valid concern. Here are my thoughts:

  • The problem isn't introduced by this PR.
  • AFAICT this is a theoretical issue at the moment. I don't remember ever encountering it on any Lokomotive platform.
  • We don't seem to have a trivial way to fix this in this PR.

That said, I see this issue as unrelated to this PR, so I wouldn't want to block merging because of it. I'd be fine with fixing this problem "on the go" if we had a trivial solution, however this doesn't seem to be the case.

FWIW, I can think of several directions for addressing the issue:

  • Render Terraform modules using Go and move the DNS creation into the platform module as suggested above.
  • Use -target to manually apply resources as suggested above. IMO using -target extensively will gradually move us to a "fully-imperative" cluster creation process, which would eventually make Terraform useless to us. As long as we want to keep Terraform, I think we should prefer to embrace it rather than fight it, so I'm not in favor of this option.

@johananl johananl force-pushed the johananl/packet-cloudflare-dns branch from b143660 to 87630d6 Compare May 13, 2020 10:58
@rata
Copy link
Member

rata commented May 13, 2020

But that is far from ideal and not very reliable, IMHO.

@rata while I agree this is not ideal, did you ever encounter some issues related to it or you know someone who did? I don't recall any issues around it.

No, but I didn't try to break it either. Timing is helping us, and dns negative caching configuration too. If we try to break it we might break it :)

This is not trivial, as Terraform providers cannot be dynamically provisioned. If they require credentials and they validate them during initialization, then one would have to provide credentials to all providers to use just one. I've tried that with AWS provider and this is not worth the effort IMO.

Ohh, good point. Thanks! Typical terraform... :-/

I think there are other tricks to do it, keeping the same structure (one provider per folder) than today. I've used such tricks in the past, for example, when there was no support for providers. I think maybe it can be simple. But good point, not as simple as I thought :-/

Thanks and sorry for the noise! :)

@johananl johananl force-pushed the johananl/packet-cloudflare-dns branch 11 times, most recently from c063455 to 8338d7e Compare May 15, 2020 12:27
@johananl johananl requested review from invidian and rata May 15, 2020 13:16
@johananl
Copy link
Member Author

I've handled all feedback AFAICT. CI is green. Need another round please.

@rata rata removed their request for review May 15, 2020 13:52
invidian
invidian previously approved these changes May 15, 2020
Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments found by shellcheck. Otherwise it looks good!

invidian
invidian previously approved these changes May 29, 2020
Copy link
Member

@invidian invidian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Remove the `zone_id` knob. We can easily calculate this value and
have no reason to bother the user with it.

Remove the `aws_creds_path` knob under `dns.route53`. AFAICT the only
case where this knob is useful is if I want to store Terraform state
in S3 on AWS account A *and* I want to use Route 53 on AWS account B.
To me it seems like a rather esoteric use case and therefore I think
it's better to remove this knob for simplicity's sake.

Convert dns.provider knob to a string. With the `zone_id` and
`aws_creds_path` knobs removed, the `route53` block is now empty.
Since the `manual` DNS options has no knobs and since `dns.zone` is
provider-agnostic anyway, we can use a simple string to designate the
selected provider.

Convert dns.Validate() and dns.AskToConfigure() to methods on
dns.Config.

Run dns.Validate() during platform initialization. Running it in the
Apply() method is too late and makes things break in an unexpected
way when passing a wrong DNS provider in the config.

Move DNS-related code outside of the Packet Terraform module for a
better separation of concerns and clearer module boundaries. Instead
of calculating the required DNS records inside the Packet module,
return the *controller IPs* as an output to be consumed by the various
DNS modules. This makes it easier to add new DNS providers and opens
the way to supporting multiple DNS providers on platforms other than
Packet. The "manual" DNS option is now a standalone Terraform module
which doesn't create any resources and instead just returns outputs
to be consumed by lokoctl.

Remove trailing dot from kubeconfig URLs. Fixes #259.
DNS-related behavior varies in Lokomotive based on the chosen
platform. In addition, multiple DNS providers are supported. This
document lists the supported DNS providers. In addition, it describes
what records are created in detail.
Although at the moment Cloudflare is supported only with Packet
clusters, Cloudflare DNS support isn't strictly tied to Packet.
Since the Packet quickstart guide is using Route 53 as the DNS
provider, it makes sense to me to add the Cloudflare-related
information as a how-to guide for the sake of keeping the quickstart
simple and minimal. Putting this information in the configuration
reference seemed strange to me since we aren't really providing
information about config knobs but are rather giving a "recipe" for
achieving a specific result (a cluster with Cloudflare DNS).
The DNS configuration syntax has been changed.
Since we have the same Terraform variables for all DNS modules, we
can reduce duplication by symlinking a shared file.
Acronyms should have a consistent case according to the Go style
guide.
Ensure DNS records were created at the DNS provider before starting
kubelet. This eliminates cases where kubelet attempts to resolve
records before they were created. This results in negative DNS
caching which in turn breaks the cluster bootstrap process.
@johananl johananl force-pushed the johananl/packet-cloudflare-dns branch from bcb29e2 to 8c5b515 Compare May 29, 2020 15:49
@johananl
Copy link
Member Author

Some small comments found by shellcheck. Otherwise it looks good!

Thanks. Forgot about shellcheck 😳 Fixed now.

Copy link
Contributor

@iaguis iaguis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@johananl johananl merged commit 569455a into master May 29, 2020
@johananl johananl deleted the johananl/packet-cloudflare-dns branch May 29, 2020 16:30
@johananl
Copy link
Member Author

johananl commented May 29, 2020

records = list(string)
})
)
provider "aws" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I don't think modules should be instantiating the providers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module didn't run for me without this since the AWS provider requires a region and I need to instantiate the provider to specify the region AFAIK. Do you have an alternative?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The instantiating code should also instantiate the provider as far as I know. So this block should be conditionally added together with module instantiation in Packet's Terraform template I guess. Something like:

provider "aws" {
  ...
}

module "dns" {
  source = "..."
}

I'm not sure how big deal that is though. Perhaps we will find out, when porting those changes to AWS for example.

Copy link
Member

@rata rata Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@invidian +1. The module should use the providers meta-argument if a custom one is needed. If not, it seems cleaner and according to terraform best practices to define it in the top level terraform file.

Should we create an issue to track this? :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created #539.

johananl added a commit that referenced this pull request Jun 2, 2020
In #422 the AWS provider was removed from the root Terraform module.
This breaks upgrades from versions which don't include #422.
johananl added a commit that referenced this pull request Jun 2, 2020
In #422 the AWS provider was removed from the root Terraform module.
This breaks upgrades from versions which don't include #422.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API server hostname in kubeconfig contains a trailing dot
4 participants