Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create control plane without publicly addressable IPs #226

Open
anurag opened this issue Jan 10, 2021 · 10 comments
Open

Create control plane without publicly addressable IPs #226

anurag opened this issue Jan 10, 2021 · 10 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@anurag
Copy link

anurag commented Jan 10, 2021

As an operator I'd like to create clusters with nodes that are completely isolated from the public internet. Instead, they should only be accessible through authorized IPs or bastion nodes.

Detailed Description

In CAPP's current implementation, all control plane and worker nodes have public IP addresses, kubelets are configured to speak the public IP address for the control plane, and anyone on the internet can attempt connections to control plane nodes. While TLS+authentication add a layer of security, it would be nice to have the ability to create cluster nodes without public IP addresses, analogous to GKE private clusters. All egress traffic would be directed to NAT gateways and inbound traffic will only be allowed from authorized IPs..

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jan 10, 2021
@displague
Copy link
Member

displague commented Feb 10, 2021

From an Equinix Metal product offering perspective, this is possible in one of two ways.

Every Equinix Metal device has a management address assigned to the first bond, by default. This per device /30 (by default) assignment is part of a 10.x.x.x/25 address range isolated to device peers within the same EM project and facility (a /56 is also created as an IPv6 management range). This https://github.com/equinix/terraform-metal-anthos-on-baremetal project used this solution, keeping the node communication on the 10.x.x.x network.

Alternatively, a device may be provisioned with VLANs. Until recently, all EM Devices used to have to toggle Layer2 support on a port or bond to enroll in this behavior. This is represented by the layer2-bonded, layer2-individual, hybrid network modes available in the Terraform provider with similar names in the EM console UI.

It is now possible to add VLANs to a device without pre-enrolling the device in Layer2 modes, but this new functionality is limited to devices in a subset of facilities (those with the IBX feature flag). We'll have to actively enable layer2 features or detect when this step can be skipped.

It is worth noting that the network modes reported in the UI and represented in the Terraform provider are not settings that can be toggled in the EM API. Network mode is a product of port bondedness, layer-2 or layer-3 state, the presence of management addresses, and the presence of VLANs.

After a few different iterations on how to represent these features in Terraform, we are coming around to the idea of giving the user control over the ports directly without offering network mode hand-waving.

Expressed statefully, and adhering closely to the EM API spec, this could look like this:

- plan: "n2.x-large"
  ports: # network_ports in the API naming
    bond0: # always comprised of ports eth0+eth1 on 2 port devices, or eth0+eth2 on 4 port devices
      bonded: true # default
      layer2: false # this is default on new devices. changes result in /ports/id/convert/layer-[2|3] API calls
      vlans: [1234, 5678, 1234] # shortened this from the API "virtual_networks"
      native_vlan: 5678 # optional, must be one of the above if set
      ip_addresses:
      - reservation_id: uuid # reserved addresses by uuid, may include global ips
      - cidr_notation: "1.2.3.4/24" # reserved addresses by address, may include global ips
      - type: private_ipv4 # dynamically assigned addresses, available via metadata.platformequinix.net
        cidr: 30 
    bond1: # comprised of port eth1+eth3 on 4 port devices
      bonded: false
    eth1: # unbonded eth ports can use most of the same attributes bond ports can use
      layer2: true
      vlans: [7654]
    eth3:
      layer2: true
      vlans: [9876]

Users are free to customize the ports, bonds, addresses, and VLANs with all the flexibility afforded by the API. Invalid configurations will bubble up through API errors and input does not have to be validated by the CAPP provider. It is difficult to validate these values because the plan, facility, state of the port, state of the bonds, and other factors affect the success of toggling a field. Through Kubernetes reconciliation loops, eventually, all bonds will be connected, disconnected, or the device will reach the desired layer-2/layer-3 mode. The desired addresses will be provisioned and assigned.

A port and bond configuration not supported today, maybe supported later or may be supported in a different plan or facility setting.

This was discussed for consideration in the Terraform provider within a packngo PR: equinixmetal-archive/packngo#239 (comment)

I don't know how well CAPP can capture this approach. CAPP would need to know which address ranges to use for what purposes, dynamically allocated ranges like the project management range could be referred to generically without knowing the precise range. In other cases, the precise range may be known in advance (IP reservations). In VLAN cases, the nodes would need special userdata to configure the network and CAPP would need to be told what ranges to manage.

Userdata scripts would need to be hearty, allowing time for the ports to be bonded, VLANs to be attached, and addresses to become available.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 11, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 10, 2021
@detiber
Copy link
Member

detiber commented Jun 24, 2021

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. labels Jun 24, 2021
@davidspek
Copy link
Contributor

I think this shouldn't be limited to just the control plane, but rather all nodes should only have private IPs and any service that needs to be exposed should be done through a load balancer. This is similar to how EKS, GKE and AKS do things.

@displague
Copy link
Member

CPEM added support for clusters with node IP addresses drawn from a VLAN accessible range in v3.5.0.

CAPP should take advantage of this for clusters that take advantage of Equinix Metal Gateway and VRF features.

@Lirt
Copy link

Lirt commented Jan 11, 2023

Hi,

I implemented change in CAPP for the servers to not use public IPs (I can open PR for that). But the blocker on this feature is that servers without public IP address don't have Internet connectivity and there is no NAT Gateway service or similar service that would enable servers to reach Internet.

This can be feasible for use-cases where Internet connectivity is not needed and all dependencies are served from internal network, but overall for most of the standard use-cases the Internet needs to be reachable (at least for kubelet to download container images).

Correct me if I'm wrong but I don't see how Metal Gateway enables servers to not have public IPs - as we can see in the Metal Gateway example the servers need to be configured with IP address on OS level (which is even less automatable).

One possible way to overcome this is to have a server that serves as NAT Gateway using standard Linux with enabled ip_forward and iptables, but that's configuration overhead with single point of failure and increases costs by 1 server.

Is there a plan to have NAT service or something similar to make this easier? Or can you provide future plans around this feature?

If I wrote anything wrong, please correct me and provide me more information.

@displague
Copy link
Member

displague commented Jan 13, 2023

@Lirt This comes down to enabling more networking configuration in ClusterAPI cluster / machine configuration.

For example, nodes could be configured to start-up in a Layer2 Equinix Metal mode (no managed IPs assigned, no Equinix Metal DHCP) and then use cloud-config to assign static IPs.

When these addresses are public, a Metal Gateway can route these nodes to the internet. At first glance, this doesn't seem all that different than using the Equinix Metal provided IP addresses. The benefit is that the VLAN may be one connected to other networks in the Equinix Metal project or even in other clouds (by using Fabric connections). The VLAN may also be one connected to a Network Edge router or a Fabric link to Equinix Internet Access.

For another scenario, let's take advantage of the off-the-shelf public IP connectivity while we configure ClusterAPI nodes to ALSO connect to a VLAN where additional services (alternate K8s environments, or non-K8s services) are running within the Equinix Metal project. We may also want to use this VLAN to connect to Fabric for any of the reasons previously stated.

Moving away from the Public IP clusters, we may want to use CAPP to create a cluster in an existing Equinix Metal Project with an existing VLAN with an existing NAT configuration, possibly with DHCP present. The NAT could be an Equinix Metal service or device (as you pointed out) or it could be Network Edge or it could be a service running somewhere on the VLAN connected through Fabric.

In these scenarios, the common need is for CAPP to:

  • provision an EM node (location, plan, OS)
  • define the network environment that the node should prepare itself for (set the correct network port settings via EM API, attach the device to BGP or VLANs or IP addresses)
  • configure the OS to boot into that networking configuration

What we can expect to be injected are:

  • existing VLANs and/or BGP services
  • existing IPs
  • existing NATs / Gateways
  • existing network accessible Container Registries

CAPP needs to be configurable enough to take advantage of existing resources.

Whether or not CAPP should be responsible for creating and managing any of those API-manageable network primitives is another matter. 😄

@Lirt Your branch / PR sounds very interesting. The simplest way to introduce Layer2 capabilities into CAPP would be to take advantage of Public IP addresses routed through Metal Gateway. This would exercise most of the paths that we would need before exchanging public IPs for private ones and changing the external environment to one where a NAT or private network is utilized along with a private container registry.

Open a draft PR?

@Lirt
Copy link

Lirt commented Jan 13, 2023

Hi,

Thank you for explanation. I am still learning about Equinix services and network setup but now it makes more sense (Metal Gateway use-case).

My PR is just leveraging the packngo library (Equinix API) to let user specify IP addresses of a device. You can see it opened here.

But I understand that you want to implement full control over ports and ip addresses as you described in your comment (#226 (comment)).

For me one thing is missing for proper device configuration and it is the fact, that the ports API is split from devices and there is no configurable link between them.

Why I said that the link is missing is an issue is scenario like this:

  1. I can create device with one API call.
  2. Server will start building and eventually build.
  3. During that or after I need to execute multiple separate API calls to convert network cards to L2/L3/Hybrid modes with specific details.
  4. It will never be deterministic what kind of network config server has before it boots into operating system and this can be harmful for automation such as cloud-init, which executes some of the steps/modules only once during first boot.

@displague
Copy link
Member

@Lirt these are valid points.

API clients do not know dynamically assigned IP addresses until after the device has been provisioned, just as it begins to boot. Elastic IPs can be assigned at device creation to get ahead of this.

The MAC address and disk configuration are also unknown until the device instance has been created and begins to provision. These factors can be known ahead of "Instance" creation when using a Hardware Reservation, but On-Demand and Spot instances do not have this advantage.

One way to work around this, for some scenarios, is to manipulate the userdata before it is consumed. This would be a race, but there is a good amount of time between when this information is API discoverable and the device begins to fetch userdata.

Userdata can be manipulated at any time via the EM API and the userdata will appear in metadata.platformequinix.com immediately (confirmed this). An alternative to replacing static userdata is to use dynamic userdata. By using shell (python, etc) scripts in userdata, we can grab these details from the OS and the metadata service. Another alternative is cloud-config Jinja templating which substitutes template symbols in the cloud-config script with metadata valyes.

It's important to point out that in full Layer2 modes, the metadata service is not reachable.
For these environments, we need a staged approach:

  • boot into Layer3 or Hybrid
  • configure the Layer2 network settings for the next boot (within the OS)
  • shutdown
  • change the network settings in the EM API
  • boot up

This can also be attempted without shutting down by racing the API changes and OS changes (the OS may need to detect the network changes before activating the new settings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

7 participants