Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds PolicySetupTimeoutSeconds option to CalicoNetwork #3186

Merged

Conversation

aaaaaaaalex
Copy link
Contributor

@aaaaaaaalex aaaaaaaalex commented Feb 21, 2024

Sibling PR to be merged first

Description

Adds option to CalicoNetwork to enable a policy-setup timeout in the CNI/Felix.

Felix's new option EndpointStatusPathPrefix gets set to it's /var/run/calico dir to enable the Felix-side feature, where-in it will create a directory "status".

CNI should have its timeout configured according to the CalicoNetwork field, and it's EndpointStatusDir should be set to Felix's /var/run/calico dir plus "/status".

For PR author

  • Tests for change.
  • If changing pkg/apis/, run make gen-files
  • If changing versions, run make gen-versions

For PR reviewers

A note for code reviewers - all pull requests must have the following:

  • Milestone set according to targeted release.
  • Appropriate labels:
    • kind/bug if this is a bugfix.
    • kind/enhancement if this is a a new feature.
    • enterprise if this PR applies to Calico Enterprise only.

@aaaaaaaalex
Copy link
Contributor Author

aaaaaaaalex commented Feb 26, 2024

Going to mark this PR as Ready for Review, although there are still two merge tests that are failing, that I'm a little confused about.

Edit: fixed

@aaaaaaaalex aaaaaaaalex marked this pull request as ready for review February 26, 2024 14:44
@aaaaaaaalex aaaaaaaalex requested a review from a team as a code owner February 26, 2024 14:44
@aaaaaaaalex aaaaaaaalex changed the title Adds PolicyProgrammingTimeoutSeconds option to CalicoNetwork Adds PolicySetupTimeoutSeconds option to CalicoNetwork Feb 27, 2024
@aaaaaaaalex aaaaaaaalex force-pushed the policy-programming-timeout branch 2 times, most recently from b9422d9 to 362c6bc Compare February 27, 2024 14:21
Copy link
Member

@caseydavenport caseydavenport left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good, left a few comments

// PolicySetupTimeoutSeconds delays new pods from going ready
// until their policy has been programmed in the dataplane.
// The specified delay defines the maximum amount of time
// that the Calico CNI will wait for policy to be programmed.
Copy link
Member

@caseydavenport caseydavenport Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the Calico CNI plugin" or just "Calico"

"CNI plugin" refers to a component, and "Calico" refers to the system as a whole.

api/v1/installation_types.go Outdated Show resolved Hide resolved
@@ -349,6 +349,13 @@ func validateCustomResource(instance *operatorv1.Installation) error {
}

}

if instance.Spec.CalicoNetwork.PolicySetupTimeoutSeconds != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need validation here to prevent setting this in other invalid scenarios - e.g., when Calico networking isn't being used, or when Calico in eBPF mode is being used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about Windows? Do we prevent this option in a cluster with Windows nodes? Do we just ignore it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working off the assumption that we would ignore this option on Windows, how does the operator deal with the issue of windows compat? Does it know if a windows node exists in the cluster?

Copy link
Contributor Author

@aaaaaaaalex aaaaaaaalex Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose I could accept it in the operator, ignore it in felix/cni, if thats an acceptable paradigm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, we would expect that if a config option isn't supported in a certain configuration, that we return an error and prevent installation of the code in that configuration. We should probably do that for this if Windows is enabled in the configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im unfamiliar with this part of the operator resources, but my first thought was to check if the windows dataplane config option was non-nil. But, that would presumably only tell me that some nodes may be windows, and I still may want the feature on for non-windows nodes 🤔 any guidance appreciated

Copy link
Member

@caseydavenport caseydavenport Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such thing as a "Windows only" cluster - they all have at least one Linux node since the control plane components don't run on Windows.

The usual ideal we strive for is that if configuration is accepted it means it is going to be implemented fully. We don't really have a mechanism to report back "This configuration is sort of working but not entirely"

I think your options are:

  • Reject the config outright if Windows is enabled.
  • Add support for this feature to Windows.
  • Change the config model so that this is very clearly Linux only (e.g., LinuxPolicySetupTimeoutSeconds)

The last one is the least desirable IMO

pkg/render/node.go Outdated Show resolved Hide resolved
pkg/render/node.go Outdated Show resolved Hide resolved
pkg/render/node_test.go Outdated Show resolved Hide resolved
pkg/controller/installation/core_controller.go Outdated Show resolved Hide resolved
@@ -355,6 +355,9 @@ func validateCustomResource(instance *operatorv1.Installation) error {
if *instance.Spec.CalicoNetwork.PolicySetupTimeoutSeconds < 0 {
return fmt.Errorf("Installation CNI spec.PolicySetupTimeoutSeconds negative value is not valid")
}
if instance.Spec.CalicoNetwork.LinuxDataplane != nil && *instance.Spec.CalicoNetwork.LinuxDataplane != operatorv1.LinuxDataplaneIptables {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably want this?

if instance.Spec.CalicoNetwork.LinuxDataplane == nil || *instance.Spec.CalicoNetwork.LinuxDataplane != operatorv1.LinuxDataplaneIptables {

Basically if the data plane is unset, then we don't know if it's OK to enable this feature or not and we need to wait until it is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably also want a check on WindowsDataplane here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the question I have above, I think this feature would still be desirable in a mixed cluster with windows and linux, if its possible to continue when that is the case. Maybe I could error if only a Windows dataplane is defined, and no Linux one?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could error if only a Windows dataplane is defined, and no Linux one?

Is there a reason we haven't implemented this for Windows?

@@ -628,6 +629,10 @@ func (c *nodeComponent) createCalicoPluginConfig() map[string]interface{} {

apiRoot := c.cfg.K8sServiceEp.CNIAPIRoot()

LinuxpolicySetupTimeoutSeconds := int32(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a private variable, and Policy should be capitalized

linuxPolicySetupTimeoutSeconds

@marvin-tigera
Copy link
Contributor

Removing "merge-when-ready" label due to new commits

@marvin-tigera marvin-tigera merged commit 984399f into tigera:master Mar 1, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants